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Abstract 


The  goal  of  filtering  theory  is  to  compute  the  filter  distribution,  that  is,  the  conditional 
distribution  of  a  stochastic  model  given  observed  data.  While  exact  computations  are 
rarely  possible,  sequential  Monte  Carlo  algorithms  known  as  particle  filters  have  been 
successfully  applied  to  approximate  the  filter  distribution,  providing  estimates  whose 
error  is  uniform  in  time.  However,  the  number  of  Monte  Carlo  samples  needed  to 
approximate  the  filter  distribution  is  typically  exponential  in  the  number  of  degrees 
of  freedom  of  the  model.  This  issue,  known  as  curse  of  dimensionality,  has  rendered 
sequential  Monte  Carlo  algorithms  largely  useless  in  high-dimensional  applications 
such  as  multi-target  tracking,  weather  prediction,  and  oceanography.  While  over  the 
past  twenty  years  many  heuristics  have  been  suggested  to  run  particle  filters  in  high 
dimension,  no  principled  approach  has  ever  been  proposed  to  address  the  core  of  the 
problem. 

In  this  thesis  we  develop  a  novel  framework  to  investigate  high-dimensional  filter¬ 
ing  models  and  to  design  algorithms  that  can  avoid  the  curse  of  dimensionality.  Using 
concepts  and  tools  from  statistical  mechanics,  we  show  that  the  decay  of  correlations 
property  of  high-dimensional  models  can  be  exploited  by  implementing  localization 
procedures  on  ordinary  particle  filters  that  can  lead  to  estimates  whose  approximation 
error  is  uniform  both  in  time  and  in  the  model  dimension. 

Ergodic  and  spatial  mixing  properties  of  conditional  distributions  play  a  crucial 
role  in  the  design  of  filtering  algorithms,  and  they  are  of  independent  interest  in 
probability  theory.  To  better  capture  ergodicity  quantitatively,  we  develop  new  com¬ 
parison  theorems  to  establish  dimension-free  bounds  on  high-dimensional  probability 
measures  in  terms  of  their  local  conditional  distributions.  At  a  qualitative  level,  we 
investigate  previously  unknown  phenomena  that  can  only  arise  from  conditioning  in 
infinite  dimension.  In  particular,  we  exhibit  the  first  known  example  of  a  model  where 
ergodicity  of  the  filter  undergoes  a  phase  transition  in  the  signal-to-noise  ratio. 
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Chapter  1 
Introduction 


1.1  Nonlinear  filtering  and  particle  filters 

A  fundamental  problem  in  a  broad  range  of  applications  is  the  combination  of  ob¬ 
served  data  and  dynamical  models.  Particularly  in  highly  complex  systems  with 
partial  observations,  the  effective  extraction  and  utilization  of  the  information  con¬ 
tained  in  observed  data  can  only  be  accomplished  by  exploiting  the  availability  of 
accurate  predictive  models  of  the  underlying  dynamical  phenomena  of  interest.  Such 
problems  arise  in  applications  that  range  from  classical  tracking  problems  in  naviga¬ 
tion  and  robotics  to  extremely  large-scale  problems  such  as  weather  forecasting.  In 
the  latter  setting,  and  in  other  complex  applications  in  the  geophysical,  atmospheric 
and  ocean  sciences,  incorporating  observed  data  into  dynamical  models  is  called  data 
assimilation. 

From  a  statistical  perspective,  it  is  in  principle  simple  to  formulate  the  optimal 
solution  to  the  data  assimilation  problem.  We  model  the  dynamical  process  that  is  not 
directly  observable  as  a  time-homogeneous  Markov  chain  (A"n)n>o  cm  a  measurable 
space  (A,  £),  with  P(An  6  dz |Xn_i  —  x)  —  p(x,  z)ilj(dz),  for  a  certain  transition 
density  p  and  reference  measure  0.  We  model  the  noisy  observations  (Yn)n> 0  as  a 
collection  of  random  variables  on  a  measurable  space  (F,  IF)  that  are  conditionally 
independent  given  (An)n>0,  with  P (Yn  G  dz \Xn  =  x)  =  g(x,  z)cp(dz),  for  a  certain 
observation  density  g  and  reference  measure  p.  The  joint  process  (An,  Yn)n> 0  that 
takes  values  in  (E  x  F,  £  0  T)  is  called  hidden  Markov  model ,  and  its  dependency 
structure  is  illustrated  in  Figure  1.1. 
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Figure  1.1:  Dependency  graph  of  a  hidden  Markov  model. 
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In  many  applications  one  is  interested  in  estimating  the  hidden  state  Xn  based 
on  the  observation  history  Yx , . . .  ,Yn  to  date,  and  to  compute  E(f(Xn)\Yi, . . . ,  Yn) 
for  a  certain  function  /,  or  for  a  certain  class  of  functions.  For  instance,  we  might 
be  interested  in  tracking  the  position  of  a  boat  given  the  noisy  measurements  coming 
from  a  radar,  and  we  might  want  to  know  how  accurate  our  estimates  are.  Or  we 
might  be  interested  in  evaluating  the  temperature  field  of  the  weather  over  a  certain 
geographical  location  given  the  noisy  measurements  coming  from  weather  stations, 
together  with  the  uncertainty  in  our  estimates.  More  generally,  one  is  often  interested 
in  computing  the  conditional  mean  and  variance  of  the  underlying  process  given  the 
observations  history. 

However,  in  almost  all  cases  the  conditional  estimates  for  individual  functions  do 
not  form  a  closed  system  of  equations,  and  one  has  to  compute  the  nonlinear  filter 
distribution 

7Tn:=P(Xne  ■  \Yi, . . . ,  Yn). 

If  the  filter  i\n  can  be  computed,  it  yields  an  optimal  (in  the  least  mean  square 
sense)  estimate  of  Xn  given  the  observations  up  to  time  n,  as  well  as  a  complete 
representation  of  the  uncertainty  in  this  estimate. 

An  important  property  of  the  filter  is  its  recursive  structure:  nn  depends  only  on 
7in-i  and  the  new  observation  Yn.  In  fact,  it  is  easily  verified  using  the  Bayes  formula 
(cf.  Section  3.3)  that  7rn  can  be  computed  recursively  in  two  steps,  the  so-called 
prediction  and  correction  step: 

prediction  correction 

^n— 1  ^  P^"n— 1  ^  7T n  CnP7Tn_i, 

where  P  and  Cn  are,  respectively,  the  prediction  and  correction  operators  that  are 
defined  as 


(P p)f  ■=  f  p(dx)p{x,x,)'if(dx')  f(x'), 

(c  w  f  P(dx)  g(x,  Yn)  f(x) 

J  p(dx)  g(x,  Yn)  ’ 

for  any  probability  measure  p  on  (E,  £)  and  any  measurable  bounded  function  /. 
When  applied  to  the  measure  7Tn_i,  P  uses  the  dynamics  of  the  underlying  Markov 
chain  to  “predict”  Xn  given  the  observation  history  Yi, . . . ,  Yn_ i,  namely, 

P^n-l  -  P(^n  €  ’  |  id,  .  .  .  ,  Yn-\). 

Then,  Cn  “corrects”  the  predictive  measure  using  the  observation  at  time  n,  that  is, 
it  weights  the  measure  Pvrn_i  by  the  likelihood  function  x  — >  g(x,Yn). 

The  recursive  nature  of  the  filter  plays  a  crucial  role  in  practice,  as  it  allows 
the  computations  to  be  implemented  on-line  over  a  long  time  horizon.  In  practice, 
however,  the  optimal  filter  is  almost  never  directly  computable:  it  requires  the  prop¬ 
agation  of  an  entire  conditional  distribution,  which  generally  does  not  admit  any 
efficiently  computable  sufficient  statistics. 
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The  practical  implementation  of  nonlinear  filtering  was  therefore  long  considered 
to  be  intractable  until  the  discovery  of  a  class  of  surprisingly  efficient  sequential  Monte 
Carlo  algorithms,  known  as  particle  filters,  for  approximating  the  filter.  The  simplest 
and  most  famous  such  algorithm  is  the  sequential  importance  resampling  (SIR)  par¬ 
ticle  filter  introduced  by  Gordon,  Salmond  and  Smith  in  1993  [28].  This  algorithm 
simply  inserts  a  random  sampling  step  into  the  Bayes  recursion  and  approximates  the 
filter  nn  by  the  resulting  empirical  measure  nn.  That  is, 

prediction  sampling  correction 

7Tn—l  ^  P^"n— 1  ^  S  P TTn— 1  ^  TTn  •  1? 


where  SN  is  the  sampling  operator  that  replaces  whatever  measure  it  is  applied  to 
with  its  empirical  measure  with  N  independent  Monte  Carlo  samples  or  particles , 
namely, 
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X(l), . . .  ,X(N)  are  i.i.d.  samples  with  distribution 


i=l 

where  5X  denotes  the  Dirac  measure  with  mass  located  at  x.  It  is  not  difficult  to 
that  this  gives  rise  to  a  standard  Monte  Carlo  error  (cf.  Section  3.3.1) 
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which  converges  to  zero  in  the  limit  for  N  that  goes  to  infinity. 


1.2  Filter  stability 

It  turns  out  that  in  order  to  properly  understand  how  the  Monte  Carlo  approximation 
of  the  filter  recursion  behaves,  we  need  to  understand  the  behavior  of  the  filter  distri¬ 
bution  itself.  In  fact,  as  shown  in  Section  3.3.1,  a  simple  analysis  that  focuses  011  the 
filter  recursion  7in  =  C„.P7rn_i  alone  would  yield  that  the  constant  C  in  the  previous 
bound  growths  exponentially  with  time  n,  which  is  what  we  would  expect  at  first 
as  the  SIR  particle  filter  adds  an  approximation  step  (represented  by  the  sampling 
operator  S^)  to  each  iteration  of  Bayes  formula.  If  the  quality  of  the  estimate  given 
by  particle  filters  were  really  to  deteriorate  with  time,  then  particle  filters  would  be 
totally  useless  in  most  practical  applications,  where  one  is  interested  in  obtaining  re¬ 
liable  estimates  at  any  time.  However,  a  deeper  analysis  that  takes  also  into  account 
also  the  probabilistic  structure  of  the  filter  distribution  7 rn  yields  that  the  constant 
C  in  the  previous  bound  does  not  depend  on  time,  so  that  particle  filters  can  indeed 
function  in  an  on-line  fashion. 

Del  Moral  and  Guionnet  in  2001  [  ]  were  the  first  to  realize  that  the  so-called 
stability  property  of  nonlinear  filters  can  be  use  as  a  dissipation  mechanism  for  the 
approximation  error  of  the  SIR  particle  filter.  Roughly  speaking,  filter  stability  says 
that  the  filter  forgets  its  initial  condition  as  times  goes  on,  something  like 

P(Xn  G  •  \X0YU . . . ,  Yn)  «  P(Xn  e  ■  |Yi, . . . ,  Yn)  for  n  large  enough. 
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This  property  represents  a  weak  form  of  conditional  independence  in  time:  as 
n  increases,  Xn  becomes  “close”  to  be  conditionally  independent  of  A"0  given  the 
observation  history  If, . . . ,  Yn  (different  notions  of  “closeness”  are  considered  in  this 
thesis,  cf.  Chapter  3  and  Chapter  7). 

From  a  practical  perspective,  the  fact  that  the  filter  is  insensitive  to  the  knowledge 
of  the  initial  condition  can  be  exploited  to  prove  that  approximation  errors  committed 
by  particle  filters  at  each  time  step  do  not  accumulate  over  time.  It  turns  out  that 
the  sampling  step  introduced  at  each  iteration  of  Bayes  formula  is  precisely  the  key 
mechanism  that  allows  particle  filters  to  exploit  filter  stability  and  to  yield  time- 
uniform  error  bounds. 

As  the  error  they  commit  is  uniform  in  time,  particle  filters  have  proved  to  perform 
extraordinarily  well  in  many  classical  applications  such  as  target  tracking,  speech 
recognition,  and  finance  [  ]. 


1.3  Curse  of  dimensionality 

Despite  their  widespread  success,  particle  filters  have  nonetheless  proved  to  be  essen¬ 
tially  useless  in  truly  complex  data  assimilation  problems.  The  reason  for  this,  long 
known  to  practitioners,  has  only  recently  been  subjected  to  mathematical  analysis  in 
the  work  of  Bickel  et  al.  [  1,  47].  Roughly  speaking,  the  constant  C  in  the  above  bound, 
while  independent  of  time  n,  must  typically  be  exponential  in  the  number  of  degrees 
of  freedom  of  the  model.  This  curse  of  dimensionality  does  not  affect  most  classical 
tracking  problems,  where  the  dimension  of  the  state  space  E  x  F  where  the  model 
(Xn,  Yn)n> o  lives  is  typically  of  order  unity.  If  we  want  to  track  the  location  of  a  boat, 
for  instance,  then  we  can  take  E  =  M2  (analogously,  F  =  M2),  which  we  interpret 
as  a  two  dimensional  space  (as  the  motion  of  the  boat  has  two  degrees  of  freedom). 
On  the  other  hand,  the  curse  of  dimensionality  becomes  absolutely  prohibitive  in 
large-scale  data  assimilation  problems  such  as  weather  forecasting  and  oceanography, 
where  model  dimensions  of  order  10'  are  routinely  encountered  [1]. 

The  curse  of  dimensionality  of  particle  filters  is  a  consequence  of  the  general  fact 
that  in  high  dimension  probability  measures  tend  to  be  singular,  that  is,  they  tend 
to  put  mass  on  different  portions  of  the  space.  The  problem  appears  even  in  a  single 
iteration  of  the  SIR  algorithm,  and  it  is  due  to  the  correction  step  performed  by 
the  operator  Cn:  in  high  dimension,  typical  samples  coming  from  a  measure  p  have 
small  likelihood  under  the  measure  C np,  as  illustrated  in  Figure  1.2.  Hence,  in  high 
dimension  already  the  empirical  measure  has  a  small  fraction  of  particles  that 
meaningfully  approximate  the  filter  distribution  7Ti  (cf.  Section  3.3.3). 

While  this  phenomenon  is  now  fairly  well  understood,  there  exists  no  rigorous 
approach  to  date  for  alleviating  this  problem  [3,  60].  Practical  data  assimilation  in 
high-dimensional  models  is  therefore  generally  performed  by  means  of  ad-hoc  algo¬ 
rithms,  frequently  based  on  (questionable)  Gaussian  approximations,  that  possess 
limited  theoretical  justification  [  1,  37,  1]. 
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Figure  1.2:  Illustration  of  the  curse  of  dimensionality  in  a  typical  iteration  of  the 
SIR  particle  filter,  (a)  Probability  measures  in  low  dimension,  (b)  Probability 
measures  in  high  dimension  (low-dimensional  representation) .  Each  sample  X  from 
p  is  represented  by  a  blue  ball  whose  size  is  proportional  to  the  likelihood  g(X ,  Yn). 

In  high  dimension  p  and  C np  tend  to  put  mass  on  different  portions  of  the  space. 

For  this  reason,  in  high  dimension  only  a  small  fraction  of  the  samples  coming  from 
p  has  a  relevant  likelihood  with  respect  to  the  observation  Yn. 

1.4  Fundamental  obstacle  in  high  dimension? 

One  of  the  main  contribution  of  this  thesis  is  to  show  that  there  is  no  fundamental 
obstacle  to  particle  filtering  in  high  dimension.  We  propose  the  first  algorithm  that 
can  avoid  the  curse  of  dimensionality,  and  we  develop  a  general  framework  that  en¬ 
compasses  a  novel  philosophy  behind  filtering  in  high  dimension.  From  a  practical 
point  of  view,  the  framework  that  we  propose  provides  a  principled  approach  to  de¬ 
sign  new  algorithms  for  high-dimensional  applications,  where  the  current  state  of  the 
art  relies  exclusively  on  heuristics.  From  a  theoretical  point  of  view,  it  the  first  time 
that  ideas  and  tools  from  statistical  mechanics  are  shown  to  play  a  fundamental  role 
in  the  analysis  of  filtering  models. 

Before  discussing  the  key  elements  that  constitute  our  framework,  we  present  an 
example  that  illustrates  how  it  is  possible  to  overcome  the  curse  of  dimensionality  in 
a  trivial  setting.  This  example  sets  the  direction  to  follow  for  the  development  of  our 
theory. 

Let  V  =  {1, . . . ,  d]  be  a  finite  index  set,  and  for  each  v  e  V  let  (A7"^,Fr(')n>0 
be  a  hidden  Markov  model  of  the  type  being  considered  so  far,  which  takes  values 
in  a  measurable  space  ( Ev  x  FV,EV  ®  3rv).  Assume  that  the  chains  forming  this 
collection  are  independent,  and  consider  the  hidden  Markov  model  ( Xn,  Yn)n> o  with 
Xn  =  (X£)vev  and  Yn  =  {Y^)v£V-  This  dependency  structure  is  illustrated  in  Figure 
1.3. 
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Figure  1.3:  Dependency  graph  of  a  (trivial)  high-dimensional  filtering  model. 


This  model  clearly  defines  a  (trivial)  high-dimensional  model,  where  the  dimension 
is  d,  the  number  of  independent  chains  being  considered.  From  the  theory  of  Bickel 
et  al.  [1,  I  ]  we  know  that  the  SIR  particle  filter  fails  miserably  when  applied  to  this 
model,  requiring  a  number  of  particles  N  that  is  exponential  in  d.  However,  in  this 
case  one  can  surmount  this  problem  in  a  trivial  fashion:  as  each  of  the  coordinates  of 
the  high-dimensional  model  is  independent,  one  can  simply  run  an  independent  SIR 
filter  in  each  coordinate.  It  is  evident  that  the  local  error  of  this  algorithm  (that  is,  the 
error  of  the  marginal  of  the  filter  in  each  coordinate)  is,  by  construction,  independent 
of  the  model  dimension  d.  In  this  sense,  this  trivial  model  shows  that  it  is  indeed 
possible  to  filter  very  efficiently  regardless  of  the  ambient  dimension  (though  not  with 
the  SIR  particle  filter,  which  fails  spectacularly). 

In  the  literature  there  is  the  widespread  belief  that  filtering  in  high  dimension  is 
possible  only  if  the  high-dimensional  model  being  considered  lives  in  a  low-dimensional 
manifold  (see  [  ]  for  instance).  The  trivial  example  that  we  just  considered,  however, 
clearly  contradicts  this  idea,  as  there  is  no  low-dimensional  structure:  as  the  chains 
are  independent,  the  global  dimension  is  the  full  model  dimension  d.  The  reason 
why  we  can  deal  efficiently  with  this  high-dimensional  system  is  the  fact  that  the 
model  is  locally  low-dimensional  (the  local  dimension  being  1,  as  each  coordinate  is 
completely  independent  from  the  others),  and  the  fact  that  we  are  interested  in  local 
errors  (marginals  of  the  filtering  distribution  on  spatial  regions  of  a  fixed  size),  as 
opposed  to  the  global  measure  of  error  usually  considered  in  the  literature  for  this 
type  of  problems. 


1.5  Decay  of  correlations  and  localization 

While  the  trivial  model  previously  introduced  does  not  have  any  practical  relevance, 
we  would  like  to  extend  the  main  ideas  that  guided  our  discussion  in  that  case  to 
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nontrivial  models  that  are  of  genuine  practical  interest.  Several  fundamental  questions 
arise  immediately. 

1.  What  sort  of  filtering  models  are  natural  to  investigate  in  high  dimension? 

2.  What  sort  of  mechanism  might  allow  to  surmount  the  curse  of  dimensionality? 

3.  How  can  such  a  mechanism  be  exploited  algorithmically? 

We  aim  to  address  each  of  these  questions  in  this  thesis.  Presently  we  provide  an 
informal  discussion  that  is  instrumental  to  describe  the  main  contribution  of  our  work. 

1.  What  filtering  models  are  natural  to  investigate  in  high  dimension? 

The  local  algorithm  proposed  to  analyze  the  trivial  model  above  (i.e. ,  running  the  SIR 
particle  filter  to  each  chain  separately)  was  made  possible  because  the  components 
of  that  model  are  truly  independent.  When  this  is  not  the  case,  we  cannot  run 
independent  particle  filters  in  each  dimension  as  all  the  dimensions  are  coupled  by 
the  dynamics  of  the  model.  We  must  therefore  introduce  a  general  class  of  nontrivial 
models  in  which  the  above  intuition  can  nonetheless  be  implemented. 

In  most  data  assimilation  problems,  the  high-dimensional  nature  of  the  model  is 
essentially  due  to  its  spatial  structure:  the  aim  of  the  problem  is  to  track  the  dynamics 
of  a  random  field  (for  example,  the  atmospheric  pressure  and  temperature  fields  in  the 
case  of  weather  forecasting).  We  therefore  take  as  a  starting  point  the  notion  that  the 
coordinates  X ”,  (v  G  V)  of  our  hidden  Markov  model  are  indexed  by  a  large  graph 
G  =  (V,  E)  that  represents  the  spatial  degrees  of  freedom  of  the  model.  That  is,  we 
consider  the  case  (E,E)  =  (X„eV  E° ,  (g)ueV  8,v)  and  (F,  T)  =  {XveV  Fv,®vGVT). 
It  is  of  course  not  reasonable  to  expect  that  the  dynamics  at  each  spatial  location  is 
independent,  as  was  assumed  in  the  trivial  model  previously  discussed.  On  the  other 
hand,  dynamics  of  spatial  systems  is  typically  local  in  nature:  the  dynamics  at  a 
spatial  location  depends  only  on  the  states  at  locations  in  a  neighborhood.  Moreover, 
the  observations  are  typically  local  in  the  sense  that  (a  subset  of)  spatial  locations 
are  observed  independently.  The  dependency  structure  of  this  type  of  models  is 
illustrated  in  Figure  1.4. 

These  local  filtering  models  are  prototypical  of  a  broad  range  of  high-dimensional 
filtering  problems,  and  they  provide  the  basic  framework  for  our  main  result.  They 
arise  naturally  in  numerous  complex  and  large-scale  applications,  including  percola¬ 
tion  models  of  disease  spread  or  forest  fires,  freeway  traffic  flow  models,  probabilistic 
models  on  networks  and  large-scale  queueing  systems,  and  various  biological,  ecolog¬ 
ical  and  neural  models.  Moreover,  local  Markov  processes  of  this  type  arise  naturally 
from  finite-difference  approximation  of  stochastic  partial  differential  equations,  and 
are  therefore  in  principle  applicable  to  a  diverse  set  of  data  assimilation  problems 
that  arise  in  areas  such  as  weather  forecasting,  oceanography,  and  geophysics  (cf. 
Section  4.4.4). 

2.  What  mechanism  can  allow  to  surmount  the  curse  of  dimensionality? 

While  the  law  of  the  model  at  each  spatial  location  is  no  longer  independent  as  in  the 
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Figure  1.4:  Dependency  graph  of  a  high-dimensional  filtering  model  of  the  type 
considered  in  this  thesis. 

trivial  model  of  the  previous  section,  large-scale  interacting  systems  can  nonetheless 
exhibit  an  approximate  version  of  this  property:  this  is  the  decay  of  correlations 
phenomenon  that  has  been  particularly  well  studied  in  statistical  mechanics  (see, 
e.g.,  [  ]).  Informally  speaking,  while  the  states  and  (X“,  Yff)  at  two  sites 

v,  w  G  V  are  probably  quite  strongly  correlated  when  v  and  w  are  close  together,  one 
might  expect  that  (X%,  Yf  )  and  (Xf,  Yff  )  are  nearly  independent  when  v  and  w  are 
far  apart  as  measured  with  respect  to  the  natural  distance  d  in  the  graph  G  (that  is, 
d{v,w )  is  the  length  of  the  shortest  path  in  G  between  v,w  G  V ).  The  idea  is  that 
due  to  the  decay  of  correlations,  also  this  type  of  model  is  locally  low- dimensional, 
in  the  sense  that  the  conditional  distribution  of  each  coordinate  only  needs  to  be 
updated  by  observations  in  a  neighborhood  whose  size  is  independent  of  the  ambient 
dimension.  That  is, 

p {K  6-1  n, ... ,  Yn)  «  P(Xvn  G  ■  |YT, . . . ,  Y:,  d(v,  w)  <  b), 

for  b  large  enough.  Roughly  speaking,  the  “local  dimension”  of  the  model  is  the 
number  of  coordinates  in  a  ball  whose  radius  is  the  correlation  length  of  the  filtering 
distribution. 

3.  How  can  such  a  mechanism  be  exploited  algorithmically? 

Both  filter  stability  and  decay  of  correlations  are  probabilistic  properties  of  the  filter 
distribution  itself:  filter  stability  represents  a  weak  form  of  conditional  independence 
in  time,  and  the  decay  of  correlations  property  represents  a  weak  form  of  conditional 
independence  in  space  (model  dimension).  As  already  mentioned,  the  sampling  step 
added  to  the  original  filter  recursion  is  the  key  to  exploit  algorithmically  filter  stability 
and  get  particle  filters  that  yield  time-uniform  error  bounds.  One  of  the  main  goal  of 
this  thesis  is  to  show  that  proper  forms  of  localization  of  the  filter  recursion  can  be 
used  to  exploit  algorithmically  the  decay  of  correlations  property  and  to  design  local 
particle  filters  that  yield  error  bounds  that  are  uniform  both  in  time  and  in  space. 
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As  mentioned  above,  the  curse  of  dimensionality  of  particle  filters  is  essentially  due 
to  the  fact  that  probability  measures  tend  to  be  singular  in  high  dimension.  However, 
while  this  is  definitely  what  happens  if  we  consider  a  high-dimensional  model  as  a 
whole,  if  the  decay  of  correlations  property  holds  then  it  should  be  possible  to  localize 
the  model  and  work  with  local  low-dimensional  portions  of  it.  As  the  problem  comes 
from  the  correction  step  of  the  filter  recursion,  what  really  matters  is  the  dimension 
of  the  observations  (cf.  Section  3.3.2),  and  it  makes  sense  to  introduce  a  localization 
step  immediately  before  the  operator  Cn  so  that  the  model  can  behave  as  “local”  for 
the  sake  of  likelihood-reweighting. 

A  speculative  back-of-the-envelope  computation  explains  how  this  might  work. 
Due  to  the  decay  of  correlations,  the  conditional  distribution  of  the  site  given 
the  new  observation  Yn  should  not  depend  significantly  on  observations  Y™  at  sites 
w  distant  from  v.  Suppose  we  can  develop  a  local  particle  filtering  algorithm  that 
at  each  site  v  only  uses  observations  in  a  local  neighborhood  K  of  v  to  update  the 
filtering  distribution.  As  we  have  now  restricted  to  observations  in  K,  the  sampling 
error  (the  variance)  at  each  site  will  be  exponential  only  in  card  A'  rather  than  in  the 
full  dimension  card  V.  On  the  other  hand,  the  truncation  to  observations  in  K  is  only 
approximate:  the  decay  of  correlations  property  suggests  that  the  bias  introduced  by 
this  truncation  should  decay  exponentially  in  diarn  K .  Therefore, 

gCard  K 

error  =  bias  +  variance  ~  eY  diam  K  -| - 

Vn 

If  the  size  of  the  neighborhoods  K  is  chosen  so  as  to  optimize  the  error,  then  the 
resulting  algorithm  is  evidently  consistent  (with  a  slower  convergence  rate  than  the 
standard  I/a/A  Monte  Carlo  rate:  this  is  likely  unavoidable  in  high  dimension)  with 
an  error  bound  that  is  independent  of  the  model  dimension  card  V. 

So,  the  general  idea  of  local  particle  filters  is  that  one  should  introduce  a  spatial 
regularization  step  into  the  filtering  recursion  that  enables  local  sampling.  While 
these  regularizations  introduce  some  bias  to  ordinary  particle  filters,  they  largely 
reduce  their  variance,  and  it  is  exactly  the  bias-variance  tradeoff  that  emerges  that 
can  be  used  to  overcome  the  curse  of  dimensionality. 

In  this  thesis  we  develop  two  localization  procedures  that  aim  at  implementing 
this  idea. 


1.  Using  independence:  block  particle  filter 

The  most  natural  way  to  localize  the  filter  recursion  is  to  marginalize  it.  In  this  thesis 
we  analyze  the  block  particle  filter  that  we  define  iteratively  as: 


prediction 

7Tn_i  y  P7Tn_i 


sampling 

- >  SA'Pdn_i 


blocking 

- y 


BS  Pfr„  , 


correction 
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VTr 


:=  CnBSNP 
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where  B  is  an  operator  that  projects  a  measure  to  the  product  of  its  marginals  over 
a  certain  partition  X  of  “blocks”  of  the  index  set  V,  that  is, 

Bp  :=  0  B Kp, 

Kex 
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where  for  any  measure  p  on  ( E ,  £)  and  J  C  h  we  denote  by  B 1  p  the  marginal  of  p 
on  (XveJEv,®veJZv). 

This  algorithm  captures  the  main  intuition  that  motivated  our  discussion  on  local 
algorithms:  choosing  X  =  ^ac^’  ^ie  block  particle  filter  reduces  to  apply¬ 

ing  the  SIR  particle  filter  independently  to  each  of  the  components  constituting  the 
model  (which,  of  course,  introduces  a  bias  unless  the  components  are  independent). 

In  Chapter  4  we  show  that  this  local  particle  filter  surmounts  the  key  obstacle  in 
high  dimension  by  providing  local  estimates  that  are  uniform  in  time  and  that  do 
not  depend  on  the  ambient  dimension. 


2.  Using  conditional  independence:  localized  Gibbs  sampler  particle  filter 

The  block  particle  filter  possesses  some  inherent  limitations  as  it  can  only  provide 
spatially  inhomogeneous  approximations  of  the  filter  distribution.  In  fact,  at  each 
iteration  the  algorithm  projects  the  approximated  filter  measure  into  the  product  of 
its  marginals  over  a  given  (fixed)  partition  of  the  environment  space. 

In  order  to  address  this  deficiency  at  a  fundamental  level,  we  consider  a  regular¬ 
ization  that  aims  at  projecting  probability  measures  to  the  class  of  Markov  random 
fields  (of  a  certain  interaction  neighborhood),  instead  of  projecting  them  to  the  class 
of  distributions  that  are  independent  across  subsets  of  coordinates,  as  in  the  block 
particle  filter.  This  is  precisely  the  idea  that  animates  the  localized  Gibbs  sampler 
particle  filter.  Heuristically,  this  algorithm  can  be  described  as  follows 


prediction 

1  ^  P^Tn—  1 


projection 

- >  MPdn_i 


correction 
- > 


sampling 

CnMPdn_i  - > 


7 rn  :=  S^CnMPvfn-i, 


where  M  is  an  operator  that  projects  a  measure  to  the  class  of  Markov  random  fields 
of  order  b,  that  is, 

(Mp)(Xv  e  A|Xnw  =  a:nw)  =  p(Xv  e 

for  every  v  G  V,  where  Nb(v)  :=  {v1  €  V  :  d(v,v')  <  b}. 

As  shown  in  Chapter  5,  by  sampling  locally  in  each  dimension  rather  than  globally 
over  all  dimensions,  the  localized  Gibbs  sampler  particle  filter  implements  a  sort  of 
“resampling  in  space.”  In  this  sense,  the  mechanism  through  which  this  algorithm 
exploits  the  decay  of  correlations  property  to  provide  spatially-uniform  error  bounds 
resembles  the  analogous  mechanism  that  allows  the  SIR  particle  filter  to  exploit  filter 
stability  and  provide  time-uniform  error  bounds. 

While  a  complete  analysis  of  this  algorithm  is  still  missing,  we  prove  a  one-step 
error  bound  for  the  bias  term  that  illustrates  the  way  this  algorithm  can  provide 
spatially-uniform  error  bounds. 


1.6  Comparison  theorems  for  Gibbs  measures 

At  this  point  it  is  not  at  all  clear  what  sort  of  mathematical  tools  are  needed  to  make 
the  speculative  ideas  discussed  so  far  precise.  In  fact,  the  rigorous  implementation 
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of  these  ideas  requires  the  introduction  of  a  mathematical  machinery  that  has  not 
previously  been  applied  in  the  study  of  nonlinear  filtering. 

As  the  trivial  example  previously  introduced  illustrates,  in  order  to  describe  effec¬ 
tively  filtering  problems  in  high  dimension  it  is  necessary  to  perform  a  local  analysis: 
we  want  to  look  at  a  local  measure  of  the  error,  and  we  want  to  be  able  to  perform 
the  analysis  using  local  quantities  of  the  model.  As  any  approximation  of  practical 
utility  in  high  dimension  must  yield  error  bounds  that  do  not  grow,  or  at  least  grow 
sufficiently  slowly,  in  the  model  dimension  card  V,  we  seek  for  quantitative  meth¬ 
ods  that  allow  to  establish  dimension-free  bounds  on  high- dimensional  probability 
distributions. 

A  general  method  to  address  precisely  this  problem  is  the  Dobrushin  comparison 
theorem  that  was  developed  by  Dobrushin  in  the  context  of  statistical  mechanics 
[18,  Theorem  3].  In  the  approach  pioneered  by  Dobrushin,  Lanford,  and  Ruelle,  a 
high-dimensional  (possibly  infinite)  system  of  interacting  random  variables  is  defined 
by  its  local  description:  for  finite  sets  of  sites  J  C  V,  the  conditional  distribution 
p(XJ  G  •  \XV\J  =  xv\J)  of  the  configuration  in  J  is  specified  given  that  the  variables 
outside  J  are  frozen  in  a  fixed  configuration  xv'xJ  (we  write  xK  =  (xk)kex  for  K  C  V). 
The  model  p  is  then  defined  as  a  probability  measure  (called  a  Gibbs  measure )  that 
is  compatible  with  the  given  system  of  local  conditional  distributions. 

The  Dobrushin  comparison  theorem  is  a  tool  to  bound  the  total  variation  difference 
between  marginals  of  Gibbs  measures  in  terms  of  their  local  conditional  distributions. 
This  tool  is  what  allow  us  to  characterize  the  crucial  way  in  which  the  decay  of 
correlations  property  enters  the  local  analysis  of  particle  filters  in  our  framework. 

Despite  being  a  powerful  tool,  the  Dobrushin  comparison  theorem  requires  the 
validity  of  restrictive  assumptions,  and  for  most  models  this  fact  poses  a  major 
limitation  on  the  applicability  of  the  theorem. 


One  of  the  contribution  of  this  thesis  is  to  develop  a  more  general  and  flexible 
machinery  that  allows  us  to  get  more  powerful  results.  By  relying  on  the  Dobrushin- 
Shlosman  [  ]  and  Weitz  [64]  conditions  for  uniqueness  of  Gibbs  measures,  instead 
of  the  Dobrushin  condition  employed  by  the  original  comparison  theorem,  the  new 
comparison  theorems  that  we  develop  in  Chapter  6  provide  more  flexible  tools  to 
analyze  the  behavior  of  algorithms  in  larger  regions  of  the  natural  parameter  space, 
and  are  of  independent  interest  in  statistical  mechanics  for  the  analysis  of  Gibbs 
measures. 

The  novel  toolbox  is  used  to  extend  qualitatively  the  analysis  of  the  block  particle 
filter  that  we  initially  obtain  using  the  original  Dobrushin  comparison  theorem. 

In  order  to  prove  the  new  comparison  theorems  we  develop  a  methodology  that 
exploits  the  connection  with  a  certain  type  of  Markov  chains  called  Gibbs  samplers. 
The  general  framework  behind  our  proofs  represents  a  novel  contribution  in  the  con¬ 
nection  between  static  Gibbs  measures  and  dynamical  Gibbs  samplers. 
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1.7  Filtering  in  infinite  dimension 

Alongside  with  the  quantitative  investigation  of  Gibbs  measures  and  its  connection 
to  filtering  algorithms  in  high  dimension,  this  thesis  also  deals  with  the  qualitative 
understating  of  conditional  ergodicity  of  Gibbs  measures  and  its  connection  to  the 
theory  of  filtering  in  infinite  dimension. 

As  previously  discussed,  both  filter  stability  and  decay  of  correlations  are  proba¬ 
bilistic  properties  of  the  filter  distribution  that  play  a  crucial  role  in  the  development 
of  particle  filters.  At  first  sight,  both  properties  might  also  seem  natural.  If  we  take 
filter  stability,  for  instance,  it  is  often  the  case  that  the  underlying  chain  (A"n)n>o 
itself  is  stable,  namely, 

P(Xn  G  •  |X0)  ~  P(Xn  G  • )  for  n  large  enough, 

and  it  seems  highly  likely  that  if  this  is  the  case  then  also  the  filter  should  forget 
its  initial  condition.  Moreover,  it  seems  natural  that  as  time  n  increases  the  ini¬ 
tial  knowledge  of  Xq  is  superseded  by  the  information  contained  in  the  observations 
Y\, . . .  ,Yn,  so  that  eventually  Xo  does  not  affect  the  filter.  However,  neither  of  these 
two  intuitions  is  always  true. 

Understanding  the  general  assumptions  that  guarantee  the  inheritance  of  stability 
from  the  underlying  chain  to  the  filter  distribution  has  been  a  longstanding  problem 
dating  back  to  the  work  of  Blackwell  in  1957  [  ]  and  Kunita  in  1971  [33],  and  it 
is  related  to  many  areas  of  probability  theory,  far  beyond  the  algorithmic  setting 
considered  in  this  thesis  [63,  59]. 

A  general  qualitative  theory  that  exhaustively  characterizes  this  phenomenon  has 
recently  been  developed  by  van  Handel  in  2009  [57],  where  it  is  shown  that  if  the 
Markov  chain  (. Xn,Yn)n>0  is  stable  (in  a  certain  total  variation  sense),  and  if  a  mild 
non-degeneracy  condition  holds  for  density  of  the  law  of  Yp.  given  X&  for  each  k  >  0 
(essentially  requiring  the  presence  of  some  noise  in  the  observations),  then  the  filter 
P(Xn  G  ■  |Yi, . . . ,  Yn)  is  also  stable.  This  result  represents  a  milestone  in  the  theory 
of  nonlinear  filtering,  settling  a  long  dispute  in  the  field. 

However,  while  this  result  holds  in  a  very  general  setting  and  there  is  no  explicit 
mention  of  dimensionality,  in  practice  it  can  only  be  applied  to  finite-dimensional 
systems.  In  fact,  on  the  one  hand,  if  the  underlying  signal  (A"n)n>o  has  an  infinite¬ 
dimensional  state  space,  then  the  ergodicity  assumption  in  total  variation  can  not 
be  satisfied;  on  the  other  hand,  if  the  observations  (Yn)n> 0  are  infinite- dimensional, 
then  the  non-degeneracy  condition  can  not  hold.  In  [  ]  it  has  been  shown  that 

the  infinite  dimensionality  of  the  underlying  signal  is  not  a  fundamental  issue,  and 
that  the  main  filter  ergodicity  result  in  [  ]  still  holds  true,  either  upon  working 

with  a  local  notion  of  convergence  in  total  variation,  or  upon  doing  the  analysis  in 
weak  convergence,  which  embodies  a  form  of  locality  in  itself.  However,  this  later 
development  still  requires  the  same  global  non-degeneracy  assumption  as  in  [  ], 

which  essentially  restricts  the  scope  of  the  theory  to  finite- dimensional  observations. 

One  of  the  contribution  of  this  thesis  is  to  develop  the  first  results  in  filtering 
with  infinitely  many  observations,  and  to  show  that  in  this  setting  completely  new 
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phenomena  can  appear.  For  instance,  in  Chapter  7  we  show  that  we  can  have  a 
completely  ergodic  infinite-dimensional  model  (A,  Y),  where  the  underlying  system 
X  is  a  collection  of  independent  random  variables  and  the  structure  of  the  observation 
Y  is  local,  and  still  it  is  possible  for  the  conditional  distribution  P(A"  G  •  \Y)  to 
display  a  phase  transition  in  the  signal-to- noise  ratio  (see  Theorem  7.7  and  Example 
7.17).  That  is,  as  we  condition  on  the  observations  there  is  a  threshold  showing  up 
such  that  if  the  signal-to- noise  ratio  is  below  it,  then  the  conditional  distribution  is 
unique;  else,  the  conditional  distribution  is  not  unique.  This  example  shows  that  while 
the  ergodicity  of  the  underlying  process  can  be  localized  so  to  recover  the  powerful 
general  result  as  in  [  7],  localizing  the  non-degeneracy  in  the  conditional  law  of  the 
observations  does  not  help. 

Far  from  being  a  theoretical  point,  the  understanding  of  filtering  theory  in  infinite 
dimension  is  crucial  for  the  development  of  particle  filters  that  can  work  in  practical 
applications.  In  fact,  it  is  well  know  that  the  qualitative  understanding  of  inhnite- 
dimensional  models  is  directly  related  to  the  quantitative  understanding  of  finite¬ 
dimensional  models  (see  [50]  and  [39]  for  instance). 


1.8  Outline  of  the  thesis 

This  thesis  consists  of  7  chapters  and  4  appendices. 

Chapter  1  is  the  introduction. 

Chapter  2  contains  a  collection  of  results  that  are  used  repeatedly  throughout  this 
thesis.  As  a  large  portion  of  this  thesis  deals  with  controlling  the  distance  between 
conditional  distributions  in  high  dimension,  we  present  a  few  elementary  lemmas  that 
serve  this  purpose,  along  with  the  main  tool  that  is  used  in  our  proofs — the  Dobrushin 
comparison  theorem  from  statistical  mechanics.  We  also  give  a  brief  overview  of 
Monte  Carlo  methods,  as  they  are  needed  to  describe  the  algorithms  presented  in 
the  next  chapters.  The  goal  of  this  chapter  is  to  provide  the  necessary  tools  that  are 
needed  in  the  remaining  of  this  thesis,  along  with  establishing  the  notation  that  is 
used  throughout. 

Chapter  3  provides  an  introduction  to  the  classical  theory  of  nonlinear  filtering 
and  sequential  Monte  Carlo  algorithms  known  as  particle  filters.  Particle  filters  are 
discussed  in  the  light  of  the  curse  of  dimensionality  phenomenon.  First,  the  sequential 
importance  sampling  (SIS)  algorithm  is  introduced,  and  it  is  showed  how  it  suffers 
from  the  curse  of  dimensionality  with  respect  to  time.  This  issue  motivates  the 
introduction  of  the  sequential  importance  resampling  (SIR)  algorithm,  for  which  time- 
uniform  error  bounds  can  be  proved.  The  notion  of  filter  stability  plays  a  central  role 
in  establishing  bounds  that  do  not  depend  on  time.  Nonetheless,  it  is  showed  that 
both  algorithms  still  suffer  from  the  curse  of  dimensionality  with  respect  to  the  spatial 
dimension  of  the  model.  This  discussion  paves  the  way  for  the  introduction  of  local 
particle  filters  that  is  developed  in  the  next  two  chapters.  The  treatment  of  the 
material  presented  in  this  chapter  is  inspired  by  [55]  and  [  ]. 

Chapter  4  introduces  the  block  particle  filter  and  shows  how  this  algorithm  over¬ 
comes  the  curse  of  dimensionality  by  yielding  errors  bounds  that  are  uniform  both 
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in  time  and  in  space.  More  generally,  this  chapter  introduces  the  class  of  high¬ 
dimensional  filtering  models  that  we  consider  in  this  thesis,  and  it  illustrates  how 
the  Dobrushin  comparison  theorem  can  be  used  to  perform  a  local  analysis  in  these 
models.  Emphasis  is  given  to  the  decay  of  correlations  property,  which  is  seen  to 
be  the  key  to  establish  spatially-uniform  error  bounds,  thus  representing  the  spa¬ 
tial  counterpart  of  filter  stability.  It  is  showed  how  decay  of  correlations  can  be 
exploited  algorithmically  by  introducing  a  regularization  step  (marginalization  over 
non-overlapping  blocks)  to  the  basic  formulation  of  the  SIR  algorithm.  The  analysis 
of  the  block  particle  filter  is  instrumental  to  developing  a  general  framework  that 
can  encompass  other  algorithms  (that  is,  other  forms  of  regularization),  such  as  the 
one  proposed  in  the  next  chapter.  To  facilitate  the  reading,  the  proofs  of  the  results 
presented  in  this  chapter  are  included  in  Appendix  A.  This  chapter  is  based  on  the 
paper  [40]. 

Chapter  5  introduces  the  localized  Gibbs  sampler  particle  filter ,  another  local  al¬ 
gorithm  that  aims  at  exploiting  the  decay  of  correlations  property  of  filtering  models 
through  a  form  of  regularization  based  on  the  notion  of  conditional  independence 
(rather  than  on  the  notion  of  independence,  as  for  the  block  particle  filter).  While  a 
complete  analysis  of  this  algorithm  is  still  missing,  we  prove  a  one-step  error  bound 
that  illustrates  how  this  algorithm  provides  spatially  homogenous  approximations  of 
the  filter  distribution,  hence  overcoming  the  main  drawback  of  the  block  particle  fil¬ 
ter  (the  proof  is  included  in  Appendix  B).  The  analysis  of  this  algorithm  prompts  for 
the  investigation  of  the  decay  of  correlations  in  general  Markov  Chain  Monte  Carlo 
methods,  and  new  challenges  arise  in  this  context.  The  material  presented  in  this 
chapter  is  new  and  has  not  been  submitted  to  publication  yet. 

Chapter  6  is  devoted  to  establishing  new  comparison  theorems  for  Gibbs  measures 
that  extend  the  applicability  of  the  original  Dobrushin  comparison  theorem  to  larger 
regions  of  the  phase  space.  The  proof  of  these  results  (contained  in  Appendix  C) 
is  part  of  a  more  general  framework  that  is  developed  to  analyze  the  convergence 
behavior  of  Gibbs  samplers,  a  particular  class  of  Markov  chains.  As  an  application, 
the  new  comparison  theorems  are  used  to  improve  qualitatively  the  analysis  of  the 
block  particle  filter  given  in  Chapter  4  to  handle  scenarios  where  ergodicity  in  space 
and  in  time  are  treated  on  a  different  footing.  This  chapter  is  based  on  the  paper 
[11],  _ 

Chapter  7  presents  some  of  the  first  results  in  the  theory  of  filtering  with  inhnitely- 
many  observations.  The  focus  of  this  chapter  is  complementary  to  the  quantitative 
framework  previously  analyzed  in  this  thesis,  mostly  in  the  realm  of  algorithms.  Now 
we  are  interested  in  the  fundamentals  of  filtering  theory  in  infinite  dimension,  and 
filter  stability  and  decay  of  correlations  are  analyzed  qualitatively  in  models  with 
infinitely-many  degrees  of  freedom.  We  show  that  completely  new  phenomena  appear 
in  this  setting:  contrarily  to  the  finite- dimensional  case,  inheritance  of  ergodicity  can 
undergo  a  phase  transition  in  the  signal-to-noise  ratio.  We  refer  to  Appendix  D  for 
the  proofs  of  the  results  presented  in  this  chapter.  The  material  of  this  chapter  is 
taken  from  the  paper  [42],  which  further  develops  this  set  of  ideas  by  yielding,  for 
instance,  conditions  that  guarantee  the  inheritance  of  ergodicity. 
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Chapter  2 
Preliminaries 


This  chapter  is  devoted  to  introducing  some  elementary  concepts  and  facts  in  prob¬ 
ability  theory  that  will  be  needed  in  what  follows.  As  this  thesis  ultimately  deals 
with  conditioning,  large  focus  is  given  to  the  notion  of  conditional  expectations  and 
conditional  distributions,  along  with  some  of  their  basic  properties.  Since  we  will 
be  mostly  concerned  with  high- dimensional  distributions,  we  present  a  collection  of 
tools  to  control  their  distances.  Emphasis  is  also  given  to  the  Monte  Carlo  paradigm, 
which  is  the  backbone  of  the  first  part  of  this  thesis.  The  material  is  presented  in  a 
streamlined  manner,  and  no  attempt  is  made  at  developing  a  systematic  treatment. 
This  chapter  also  serves  to  set  the  notation  being  adopted  in  this  thesis. 

We  assume  that  the  reader  is  already  familiar  with  measure-theoretic  probability 
theory  at  the  level  of  an  introductory  class  on  the  subject.  We  refer  to  [  ]  for  an 

agile  and  beautiful  introduction  to  this  material,  and  to  [  ]  for  a  comprehensive  and 

systematic  treatment  of  it. 

2.1  Notation  and  conventions 

We  begin  by  establishing  some  notations  and  conventions  that  will  be  used  through¬ 
out. 

A  function  from  a  measurable  space  (E,  £)  to  M  :=  [—00,  +00],  or  a  subset  of  it,  is 
£- measurable  it  if  is  measurable  relative  to  £  and  the  Borel  a- algebra  on  M.  We  write 
/  G  £  to  mean  that  the  function  /  is  £-measurablc.  We  write  1a  for  the  indicator 
function  on  the  event  A  G  £.  We  say  that  a  function  is  positive  if  it  takes  values  in 
M+  :=  [0,  +00]. 

When  we  say  that  X  is  a  (E,  £)-valued  random  variable  with  distribution  /x,  we 
mean  that  there  is  a  probability  space  (O,  TC,  P)  in  the  background  so  that  X  is  a 
random  variable  taking  values  on  the  measurable  space  (E,  £)  and 

I  P (du)  /( X(u))  =  E  f(X)  =n f  =  J  p(dx)  f(x) 

for  any  positive  £-measurable  function  /.  If  X  has  distribution  /i,  we  write  X  ~  p. 
Given  a  random  variable  A",  we  denote  by  crX  the  a-algebra  generated  by  it. 
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To  keep  the  use  of  parentheses  at  minimum,  we  write  E  XY  to  mean  E (XY). 
To  avoid  pedantry,  whenever  easily  inferred  by  the  context,  we  will  often  not  specify 
the  domain  where  functions  are  defined,  the  cr-algebras  where  events  live,  and  the 
cr-algebras  involved  with  the  definition  of  measurable  functions.  For  instance,  we  will 
often  say  that  a  probability  measure  y  on  a  measurable  space  (E,  £)  is  defined  by 
n(dx ),  n(A),  or  yu/,  without  mentioning  that  this  definition  has  to  hold,  respectively, 
for  each  x  €  E,  each  A  e  £,  or  for  each  positive  £-measurable  function  /.  Often  we 
will  also  say  that  y  is  a  probability  measure  on  E,  when  we  really  mean  (E,  £). 

2.2  Conditioning  and  Bayes  formula 

In  this  thesis  we  will  be  primarily  interested  in  the  behavior  of  conditional  expecta¬ 
tions  and  conditional  distributions.  We  presently  recall  some  of  the  main  definitions 
and  properties  that  will  be  needed  in  what  follows.  As  a  large  part  of  this  thesis 
is  devoted  to  Monte  Carlo  approximations,  emphasis  is  given  to  the  role  of  random 
variables.  As  such,  results  in  this  section  are  mainly  phrased  in  terms  of  random  vari¬ 
ables,  distribution  of  random  variables,  and  cr-algebras  generated  by  random  variables, 
rather  than  in  terms  of  probability  measures  and  generic  cr-algebras. 

Definition  2.1  (Conditional  expectation).  Let  X  be  aR-valued  random  variable,  and 
let  Y  be  a  (. F ,  T)- valued  random  variable.  The  conditional  expectation  of  X  given  Y  is 
any  random  variable  of  the  form  h(Y),  where  h  is  a  R-valued  T -measurable  function, 
such  that  the  following  holds  for  any  positive  T -measurable  function  f: 

Eh{Y)f{Y)  =  EXf{Y). 

We  use  the  notation  E(A"|E)  to  indicate  any  such  random  variable  h(Y).  We  also 
write  E(X\Y  =  y)  to  mean  the  value  that  any  such  function  h  takes  at  y  e  F ,  that 
is,  E(X\Y  =  y)  =  h(y )  (recall  that  conditional  expectations  are  defined  up  to  almost 
surely  equivalences). 

The  conditional  expectation  of  X  given  Y  is  the  function  of  Y  that  provides  best 
estimates  of  X  in  the  least  square  sense,  as  the  following  lemma  shows. 

Lemma  2.2  (Optimality  of  conditional  expectation).  Let  X  be  a  WL-valued  random 
variable,  and  let  Y  be  a  ( F,A)-valued  random  variable.  Assume  that  EX2  <  oo. 
Then,  the  function  y  e  F  — >■  h(y)  E{X\Y  =  y)  satisfies 

h  =  argminE  (. X  —  g(Y'))2, 

9 

where  the  minimization  is  with  respect  to  all  R-valued  measurable  functions  g  such 
that  E  g{Y)2  <  oo. 

Proof.  By  the  properties  of  conditional  expectations  we  have 

E  ( X  -  h(Y))2  =  E  X2  +  EE(X|E)2  -  2E XE(X\Y) 

=  EX2  —  E E(X\Y)2  <  EX2  <  oo. 
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It  remains  to  prove  that  for  any  measurable  function  g  we  have 

E  (X  —  h{Y))2  <  E  (X  —  g(Y))2. 

For  simplicity,  define  H  :=  h{Y)  =  E[A"|Y']  and  G  :=  g(Y).  Then, 

E  (X  -  H)2  =  E  (X  -  G  +  G  -  Hf 

=  E  (X  -  G)2  +  E  (G  -  H )2  +  2E((X  -  G)(G  -  H)) 

=  E  (A"  —  G)2  —  E  (G  —  H)2 
<  E  (X  -  G)2, 

where  we  used  that,  by  the  properties  of  conditional  expectations, 

E  (X  -  G)(G  -  H)  =  EE((X  -  G)(G  -  H)\Y )  =  EE(AT  -  g(Y)\Y)(G  -  H) 
=  E  (H  -  G)(G  -  H)  =  -E  (G  -  H)2. 


□ 


Let  us  recall  the  definition  of  transition  kernels,  which  is  instrumental  for  the 
definition  of  conditional  distributions  given  immediately  below. 

Definition  2.3  (Transition  kernel).  Let  (E,  8)  and  (F,  T)  be  measurable  spaces.  Let 
K  be  a  mapping  from  fix  J  into  M+.  Then,  K  is  called  a  transition  kernel  from 
(E,  £)  to  (F,  T)  if  the  following  two  conditions  are  satisfied: 

(a)  the  mapping  x  — >  K(x,  B )  is  £- measurable  for  every  set  B  e  T; 

(b)  the  mapping  B  — »  K(x,  B )  is  a  measure  on  (. F ,  T)  for  every  x  G  E. 

If  K  is  a  transition  kernel  from  (E,E)  to  (F,  T),  //  is  a  probability  measure  on 
(E,  £)  and  /  is  a  (F,  T)- measurable  function,  we  use  the  notation 

K f(x)  =  Kxf  =  J  K(x,dy )  f(y ), 
gKf  =  J  n(dx)  K  (x,  dy)  f  (y) . 

Definition  2.4  (Conditional  distribution).  Let  X  be  an  (E,  £,)-valued  random  vari¬ 
able,  and  let  Y  be  a  (. F ,  T) -valued  random  variable.  A  probability  kernel  P  :  F  x  £  — x 
[0, 1]  which  satisfies 

Pf(Y)  =  E(/(X)|y) 

for  every  positive  £- measurable  function  is  called  the  conditional  distribution  of  X 
given  Y .  Sometimes  we  also  write  P(X  e  dx\Y)  to  mean  P(Y,dx). 

Remark  2.5  (Random  measure).  Given  a  measurable  space  (E,£),  a  random  mea¬ 
sure  n  on  (. E ,  £)  is  a  transition  kernel  from  the  underlying  probability  space  (fl,  “K.  P) 
to  (. E ,  £).  We  say  that  a  collection  of  random  variables  AR, . . . ,  Xn  on  (. E ,  £)  is  i.i.d. 
coming  from  the  random  measure  fi  °n  (E,  £)  if  there  exists  a  random  variable  Y 
taking  values  in  some  measurable  space  (F,  T)  such  that  the  following  holds  true  for 
all  positive  £- measurable  functions  /1; . . . ,  f^: 

E(/i(AR)  •  •  •  fN(XN)\Y)  =  E(/1(X1)| Y)  ■  ■  ■  E(fN(XN)\Y)  =  /xfi(Y)  •  •  •  pfN(Y). 
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Given  two  probability  measures  /i  and  v  on  a  measurable  space  (. E ,  £),  recall  the 
following  definitions.  If  for  each  A  e  £  such  that  z/(A)  =  0  we  have  /r(A)  =  0,  then  /j 
is  said  to  be  absolutely  continuous  with  respect  to  z/,  and  we  write  fi  v.  If  yU  <d  zv 
and  v  <C  /i,  then  p  and  u  are  said  to  be  equivalent,  and  we  write  p  ~  u.  If  there 
exists  A  e  £  such  that  p( A )  =  0  and  z/(A)  =  1,  then  p  and  v  are  said  to  be  mutually 
singular,  and  we  write  p  _L  v. 

The  following  is  a  key  result  that  relates  probability  measures  that  are  absolutely 
continuous. 


Theorem  2.6  ( Radon- Nikodym  derivative).  Let  X  and  Z  be  (E,  L)-valued  random 
variables  with  distribution  p  and  v  respectively.  Assume  that  p  u.  Then,  there 
exists  a  positive  £ -measurable  function  called  the  Radon-Nikodym  derivative  such 
that 

E/(X)  =  E*(Z)/(Z) 
for  each  positive  £ -measurable  function  f . 

Proof.  We  refer  to  [65]  for  a  proof  of  such  result.  □ 

The  following  is  a  key  result  that  relates  the  way  absolutely  continuous  probability 
measures  behave  under  conditioning.  It  is  one  of  the  many  forms  of  Bayes  formula. 


Theorem  2.7  (Bayes  formula).  Let  X  and  Z  be  two  (E,  £) -valued  random  variables, 
and  let  Y  be  a  (F,  T) -valued  random  variable.  Let  p  be  the  distribution  of  (A",  Y),  and 
let  v  be  the  distribution  of  ( Z ,  Y),  with  ji  <^v.  Then,  for  each  positive  L-measurable 
function  f  we  have 


e(/P0|y) 


E(%(Z,  Y)f(Z)\Y) 
E(*(Z,Y)|Y) 


Let  Q  be  the  conditional  distribution  of  Z  given  Y .  Then,  the  conditional  distribution 
of  X  given  Y  is  given  by  the  probability  kernel  P  defined  as 


P(y,  dz) 


Q(y>dz)  dk(ziy) 

fQ(y,dz)  %{z,y) 


Proof.  We  only  prove  the  statement  for  conditional  expectations,  as  the  statement 
for  conditional  probabilities  follows  immediately  by  applying  Definition  2.4.  Fix  a 
positive  £-measurable  function  /.  First,  we  prove  that 


E (t(ZX)  f(Z)\Y)  =  E(/(A')|K)E(*(Z,Y)|Y). 


As  the  right-hand  side  is  clearly  a  function  of  Y ,  by  definition  of  conditional  expec¬ 
tations  we  only  need  to  prove  that 

EE(/(A')|Y)  E(*(Z,  Y)\Y)  g(Y)  =  E  %(Z,  Y)  f(Z)  g(Y) 

for  every  positive  (F,  T) -measurable  function  g.  In  fact,  using  the  properties  of  con¬ 
ditional  expectations  and  using  the  Radon-Nikodym  theorem  (Theorem  2.6)  for  the 
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random  variables  (A",  Y)  and  (Z,  Y)  we  have 

EE(f(X)\Y)E(%(Z,Y)\Y)g(Y)  =  EE(f(X)\Y)%(Z,Y)g(Y) 

=  EE(f(X)\Y)  g(Y) 

=  E  f(X)g(Y) 

=  E  £(Z,Y)  f(Z)  g(Y). 

To  conclude  the  proof  we  only  need  to  prove  that 

E(f(Z,y)|y)>0  P-a.s. 

Using  agan  the  Radon-Nikodym  theorem  we  get 

E  -*-E(^(z,v)|y)=o(^  )  =  E  y  )  ^-E(^(Z,Y)IY)=o(^  ) 

=  EE(*(z,r)|y)  iE(g(Ziyl|y)»„(y)  =  o. 

□ 


As  an  immediate  application  of  Bayes  formula  we  have  the  following  lemma  on  the 
computation  of  conditional  distributions.  While  this  lemma  could  be  proved  using 
directly  the  definition  of  conditional  expectation,  we  prove  it  using  Bayes  formula,  as 
it  is  representative  of  the  way  Bayes  formula  will  often  be  used  in  this  thesis. 

Lemma  2.8  (Computation  of  conditional  distributions).  Let  X  be  an  (E,  L)-valued 
random  variable,  and  let  Y  be  a  (F,U) -valued  random  variable  such  that  for  each 
positive  (E  x  F,  £  <8>  T)- measurable  function  f  we  have 

E  f(X,Y)=  [  p(dx)X(dy)'y(x,y)  f(x,y), 


where  p  and  A  are  probability  measures  on  (E,  £)  and  (F,  £F)  respectively,  and  7  is  a 
strictly  positive  £  (8)  T -measurable  function.  Then,  the  conditional  distribution  of  X 
given  Y  is  given  by  the  probability  kernel  P  defined  as 

p(dx)7(u 

/  p{dx)  7(x 


Proof.  Define  the  following  two  probability  measures  on  (E  x  F,  £  (8)  T): 


p(dx,  dy )  :=  p(dx)  A (dy)  7(x,  y), 
v(dx,dy)  :=  p(dx)  A  (dy). 

Clearly  p  v  and  —  7.  By  dehnition  p  is  the  distribution  of  (A,  Y).  Let  Z 
be  an  (E,  £)- valued  random  variable  such  that  v  is  the  distribution  of  ( Z,Y ).  By 
independence  we  immediately  hnd  that  the  conditional  distribution  of  Z  given  Y  is 
given  by  the  probability  kernel  Q  defined  as 


Q(y,dz )  =  p(dz). 

Then,  by  Bayes  formula  (Theorem  2.7)  the  result  follows  immediately.  D 
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2.3  Distances  between  probability  measures 


In  this  thesis  we  will  face  the  problem  of  measuring  and  controlling  the  distance 
between  probability  measures.  To  this  end,  we  currently  introduce  the  two  main 
notions  of  distance  that  we  will  consider,  along  with  some  elementary  lemmas  on 
their  behavior. 

Let  ( E ,  £)  be  a  measurable  space,  and  let  p  and  v  be  two  (possibly  random) 
probability  measures  on  it.  We  define  the  total  variation  distance  between  p,  and  v 
as 

\\n-u\\:=  sup  \nf-vf\, 

/efi:||/||oo<l 

where  ||/||oo  :=  supTgE  |/(x)|.  We  will  also  need  the  following  distance  between 
probability  measures: 


|||^-HI|:=  sup  0/  -  vf)2- 

/6£:||/||oc<l 

It  is  easy  to  verify  that  ||  •  ||  and  |||  •  |||  dehne  two  metrics  in  the  space  of  probability 
measures.  Both  metrics  yields  numbers  between  0  (if  /i  =  v)  and  2  (if  /a  T  v).  In 
fact,  if  p  _L  v  then  there  exists  A  e  £  such  that  p(A)  =  0  and  v(A)  =  1,  and  choosing 
f  =  1A-  lAo,  where  Ac  is  the  complement  of  A,  we  have  | /if  —  vf  |  =  2.  Note  that  if 
p  and  v  are  not  random,  then  we  clearly  have  |||/x  —  u\\\  =  ||/i  —  v\\. 

We  now  present  some  results  in  the  general  setting  of  (possibly  random)  proba¬ 
bility  measures.  These  results  hold  both  with  respect  to  the  metric  |||  •  |||  and  with 
respect  to  the  metric  ||  •  ||  (in  the  latter  case,  if  the  probably  measures  are  random 
then  these  bounds  hold  for  each  realization  of  the  randomness). 

As  we  will  be  interested  in  conditional  distributions,  we  need  to  understand  how 
conditioning  affects  the  distance  between  measures.  Since  conditioning  introduces 
weights  on  measures  (see  Bayes  formula,  Theorem  2.7),  we  will  need  the  following 
lemma. 

Lemma  2.9  (Weighted  measures).  Let  /a  and  v  be  (possibly  random)  probability  mea¬ 
sures  on  a  measurable  space  (E,  £),  and  let  g  be  a  real-valued  L-measurable  func¬ 
tion  which  is  bounded  away  from  zero  and  infinity,  that  is,  inf; t£e  g(x)  >  0  and 
sllP X£e9(x)  <  00  ■  Define 

/  n  _  f  p(dx)g(x)lA(x)  _  f  i/(dx)  g(x)  lA(x) 

M  /  p(dx)  g(x)  ’  V*A}-  f  u(dx)  g(x)  ' 


Then, 


inf 


sup  xeE9(x) 


—  v\ 


The  same  conclusion  holds  if  the 


xeEg(x) 

norm  is  replaced  by  the 


||  -norm. 
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Proof.  For  any  real-valued  measurable  function  /  we  have 


gjgf)  _  vjgf)  =  Mg/)  -  vjgf)  +  vjgf)  _  K gf ) 

fig  vg  jig  ng  vg 

=  —  {'I(rr)-"(rr)}  +  — 1 —  i"  Cirf-) 

gg  l  VIM  00/  VIMoc/J  vg  gg  l  VIM°o/ 


-  g 


g 


If  we  assume  that  ||/||oo  <  1,  then  we  have  ||  j|^~ ||oo  <  1  and  v(fg )  <  v(g),  as  g  is 
positive.  As  II  —  II oo  A  1  and  fig  >  inf x  <7(0;),  the  proof  follows  immediately  by  using 
the  triangle  inequality  for  the  metric  |||  •  |||  or  for  the  metric  ||  •  ||.  □ 

A  collection  of  random  variables  (. Xn)n>o  taking  values  in  a  measurable  space 
(. E ,  £)  is  a  Markov  chain  if  there  exists  a  transition  kernel  P  from  (E,  £)  to  ( E ,  £) 
such  that  for  each  n  >  1  and  each  A  e  £  we  have 


P(Xn  G  A|X0,  •  ■  • ,  Xn)  =  P{Xn-U  A). 

In  this  thesis  we  will  be  mostly  interested  in  stochastic  systems  that  can  be  described 
as  Markov  chains.  Hence,  we  need  to  understand  how  the  Markovian  dynamics  affects 
the  distance  between  probability  measures.  The  so-called  minorization  condition  rep¬ 
resents  a  strong  condition  that  causes  Markov  chains  to  forget  their  initial  condition 
at  a  geometric  rate,  as  the  following  lemma  shows. 

Lemma  2.10  (Minorization  condition  for  Markov  chains).  Let  //  and  v  be  (possibly 
random)  probability  measures  on  a  measurable  space  ( E ,  £)  and  let  P  be  a  transition 
kernel  from  (E,E)  to  (E,E).  Then, 

\\tiP-vP%  < Mm- Hll- 

If  there  exist  a  probability  measure  p  on  ( E ,  £)  and  e  >  0  such  that  P  satisfies  the 
following  minorization  condition 

P(x,  A)  >  ep(A )  for  each  x  G  E,  A  e  £, 


then 

W\gP  -  vP\\\  <  (l-e)  lllh-HII- 

The  same  conclusions  hold  if  the  |||  •  |j|  -norm  is  replaced  by  the  ||  •  ||  -norm. 

Proof.  The  conditions  /  G  £,  ||/||oo  <  1  clearly  imply  Pf  G  £,  ||-P/||oo  <  1-  The  first 
statement  of  the  lemma  follows  immediately: 

W\pP  -  uP\W  =  sup  \ZFj  (pP f  —  uP f)2  <  lll/i  —  Z^lll- 

/e£:||/||oo<l 


To  prove  the  second  statement,  define 


K(x,  A) 


P(x,A )  -ep(A) 
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1  —  £ 


for  each  x  G  E,  A  G  £.  By  the  minorization  condition  it  is  easy  to  verify  that  K  is  a 
transition  kernel.  As 

pP  -  UP  =  (1  -  e){pK  -  vK), 
proceeding  as  above  we  get 

IH^P  -  uP\\\  =  (1  -  e)  sup  VE  (pKf  -  vKf  )2  <  (1  -  e)\\\p  -  u\\\. 

/e£:||/[|oo<i 

The  same  argument  holds  with  the  ||  •  ||-norm.  □ 

Under  the  minorization  condition  the  map  //  — >  / iP  is  a  strict  contraction  in  the 
HI  •  HI  norm.  This  implies  that  a  Markov  chains  is  geometrically  ergodic :  the  difference 
of  the  law  of  the  Markov  chain  started  at  two  initial  measures  decays  geometrically 
in  time,  namely, 

\\\fiPn  -  uPn\\\  <  (1  —  e)n  lll/r  —  i/|||. 

2.4  Distances  between  probability  measures  in 
high  dimension 

In  this  thesis  we  will  be  interested  in  the  behavior  of  probability  measures  in  high 
(possibly  infinite)  dimension.  The  canonical  description  of  a  high-dimensional  random 
system  is  provided  by  specifying  a  probability  measure  p  on  a  (possibly  infinite) 
product  space  E  =  riie/  El:  each  site  i  E  I  represents  a  single  degree  of  freedom, 
or  dimension,  of  the  model.  When  /  is  defined  as  the  set  of  vertices  of  a  graph, 
the  measure  p  defines  a  graphical  model  or  a  random  held.  Models  of  this  type  are 
ubiquitous  in  statistical  mechanics,  combinatorics,  computer  science,  statistics,  and 
in  many  other  areas  of  science  and  engineering. 

Let  p  and  p  be  two  such  models  that  are  defined  on  the  same  space  E.  We  ask 
the  following  basic  question:  when  is  p  a  good  approximation  of  pi  As  briefly  seen  in 
the  previous  section,  probability  theory  provides  numerous  methods  to  evaluate  the 
difference  between  arbitrary  probability  measures.  However,  the  high-dimensional 
setting  brings  some  specific  challenges:  any  approximation  of  practical  utility  in  high 
dimension  must  yield  error  bounds  that  do  not  grow,  or  at  least  grow  sufficiently 
slowly,  in  the  model  dimension  d  =  card/.  We  therefore  seek  quantitative  meth¬ 
ods  that  allow  to  establish  dimension-free  bounds  on  high- dimensional  probability 
distributions. 

The  Dobrushin  comparison  theorem  that  we  are  about  to  introduce  is  a  power¬ 
ful  (albeit  blunt)  tool  to  bound  the  total  variation  distance  between  marginals  of 
high-dimensional  probability  measures  p  and  p  in  terms  of  their  local  conditional 
distributions.  This  method  was  developed  by  Dobrushin  in  [18,  Theorem  3]  in  the 
context  of  statistical  mechanics.  Presently  we  introduce  this  tool  in  its  simplified 
form,  which  is  due  to  Follmer  [  ]  and  has  become  standard  textbook  material,  cf. 

,  Theorem  8.20]  and  [  15,  Theorem  V.2.2].1  Despite  the  crucial  importance  that  this 

1  Note  that  our  definition  of  ||  •  ||j  differs  by  a  factor  2  from  that  in  [27]. 
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theorem  has  for  the  results  that  will  be  developed  in  this  thesis,  we  refer  to  Chapter 
6  for  its  proof  (see  Section  6.3.1  in  particular).  In  fact,  one  of  the  goal  of  Chapter  6 
is  precisely  to  develop  a  more  general  version  of  this  comparison  theorem. 

Define  the  coordinate  projections  X1  :  x  i— >•  xl  for  x  G  E  an  %  G  I.  For  any 
probability  p  on  E ,  we  fix  a  version  pl  of  the  regular  conditional  probability 

&{A)  :=  p(Xi  E  =  xAW). 

We  also  define  for  J  C  I  the  local  total  variation  distance 


\\p-p'\\j'=  sup  \p(f)  -p'(/)|, 

/£SJ:|/|<1 

where  SJ  is  the  class  of  measurable  functions  /  :  E  — >  M  such  that  f{x)  =  f(z) 
whenever  xJ  =  zJ .  For  J  =  I,  we  write  \\p  —  p'\\  for  simplicity. 

Theorem  2.11  (Dobrushin  comparison  theorem).  Let  p,  p  be  probability  measures  on 
E.  Define 

Cij  =  l  sup  || plx  -  plz II,  bj  =  sup  ||/4  -  /4II- 

^  x,z£E:xI\W=zI\{ri 

Suppose  that  the  Dobrushin  condition  holds: 


max 

i&I 


Qj  < 

16/ 


1. 


Then  the  matrix  sum  D  :=  Cn  is  convergent,  and  we  have  for  every  J  C  I 


\\P-P\\j  <  D^b3- 

ieJ  jei 

The  Dobrushin  comparison  theorem  can  be  informally  interpreted  as  follows. 
measures  the  degree  to  which  a  perturbation  of  site  j  directly  affects  site  i  under  the 
distribution  p.  However,  perturbing  site  j  might  also  indirectly  affect  i:  it  could  affect 
another  site  k  which  in  turn  affects  i,  etc.  The  aggregate  effect  of  a  perturbation  of 
site  j  on  site  i  is  captured  by  the  quantity  Dt] .  If  D%3  decays  exponentially  in  the 
distance  d(i,j )  (which  is  a  useful  manifestation  of  the  decay  of  correlations  property 
that  we  will  often  encounter  in  this  thesis),  then  Theorem  2.11  yields,  for  example, 
IIP  -  p\\i  ^  Ej  e~d{i,j)bj,  where  bj  measures  the  local  error  at  site  j  between  p  and  p 
(in  terms  of  the  conditional  distributions  p>  and  ft). 

In  many  applications  it  is  natural  to  describe  high-dimensional  probability  distri¬ 
butions  in  terms  of  local  conditional  probabilities  of  the  form  plx.  This  is  in  essence 
a  static  picture,  where  we  describe  the  behavior  of  each  coordinate  i  given  that  the 
configuration  of  the  remaining  sites  /\{z}  is  frozen.  In  models  that  possess  dynamics, 
this  description  is  not  very  natural.  In  this  setting,  each  site  i£  /  occurs  at  a  given 
time  r(i),  and  its  state  is  only  determined  by  the  configuration  of  sites  j  G  /  in  the 
past  and  present  r{j)  <  r(i),  but  not  by  the  future.  It  is  therefore  interesting  to  note 
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that  the  original  comparison  theorem  of  Dobrushin  [18]  is  actually  more  general  than 
Theorem  2.11  in  that  it  is  applicable  both  in  the  static  and  dynamic  settings.  We 
presently  state  the  one-sided  counterpart  to  Theorem  2.11,  and  we  refer  to  Chapter 
6  for  a  more  general  version  of  this  result  and  for  its  proof  (see  Section  6.3.3). 

Assume  that  we  are  given  a  function  r  :  I  — >  Z  that  assigns  to  each  site  %  G  /  an 
integer  index  r(i).  Define 


I<i  ■=  {:')  e  I  :  r(j)  <  i}. 

For  any  probability  p  on  E,  we  fix  a  version  7*  of  the  regular  conditional  probability 
7 1(A)  :=  p(Xi  G  AlX^wW}  =  a/<T(i)\{i}). 


We  can  now  state  the  one-sided  Dobrushin  comparison  theorem. 

Theorem  2.12  (One-sided  Dobrushin  comparison  theorem).  Let  p,  p  be  probability 
measures  on  E.  Define 


Ca  =  - 
2 


sup 


Il7i-7*ll, 


Suppose  that  the  Dobrushin  condition  holds: 


bj  =  sup  hi  -  7x1 

x£  E 


max 

iei 


Qj  <  1- 


16 1 


Then  the  matrix  sum  D  :=  Is  convergent,  and  we  have  for  every  J  C  / 


\\P-P\\j  <J2J2Diibr 

ieJ  jei 

Note  that  the  one-sided  comparison  theorem  can  be  interpreted  as  a  generalization 
of  Theorem  2.11  (just  take  r  to  be  a  constant  function).  However,  we  stated  two 
different  theorems  to  stress  the  difference  between  the  static  and  dynamic  case.  Both 
theorems  will  play  a  crucial  role  in  this  thesis. 

In  order  to  use  these  comparison  theorems  we  must  be  able  to  bound  the  quantities 
Cij  and  bj.  The  elementary  lemmas  introduced  in  Section  2.3  will  be  used  precisely 
for  this  purpose.  We  presently  introduce  a  lemma  that  will  be  essential  for  bounding 
the  matrix  D  coming  from  the  comparison  theorems.  This  result  states  that  if  Ct] 
decays  exponentially  in  the  distance  between  i  and  j  at  a  sufficiently  rapid  rate,  then 
Dij  will  also  decay  exponentially  in  the  distance  between  i  and  j.  It  is  essentially  a 
simple  lemma  about  matrices. 

Lemma  2.13.  Let  L  be  a  finite  set  and  let  m  be  a  pseudometric  on  L.  Let  C  = 
(Cij)i,jei  be  a  matrix  with  nonnegative  entries.  Suppose  that 

max  V  em{l^)Cll  <  c  <  1. 
j£i 


24 


Then  the  matrix  D  =  ^n>0  Cn  satisfies 


max 

iei 


eni(hf)Dv  < 


jei 


1  —  c 


In  particular,  this  implies  that 

Ep-m{i,J) 

Da  <  - - 

13  ~  1  -  c 

j&J 

for  every  J  C  / . 

Proof.  Define  for  any  matrix  A  with  nonnegative  entries  the  norm 

||A||m:=maxVem^AJ. 

jei 

Using  m(i,j )  <  m(i,  k )  +  m(k,j),  we  compute 

P5|U  =  maxVe’*'T4Btj 

2Gi  z z — * 

je/  fee/ 


<  max  ^  em(i’k)Aik  em{k’j)B , 


kj 


kei 


jei 


so  ||A||m  is  a  matrix  norm.  Therefore, 

IPIU<£  lie'll”  <  Ec“  =  isv 

n>  0  n>0 

As  __  _ 

e^^Aij  <J2em{i,j)Av  ^  PIL, 

jeJ  jeJ 

the  last  statement  of  the  lemma  follows  immediately.  □ 

In  the  remainder  of  this  section  we  present  two  simple  results  that  are  meant  to 
illustrate  the  models  that  we  will  consider  in  this  thesis. 

Often  we  will  state  the  general  fact  that  “probability  measures  tend  to  be  sin¬ 
gular  in  high  (or  infinite)  dimension.”  The  following  proposition  exhibits  a  concrete 
manifestation  of  this  general  fact. 

Proposition  2.14.  Let  (. E ,  £)  be  a  measurable  space  and  let  //  and  u  be  two  probability 
measures  on  it.  Define  the  following  product  measures  on  (EN,  £N): 


nSN  neN 


If  V  then  p®  _L  u®. 
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Proof.  Let  (0,  34,  P)  be  a  probability  space,  and  let  (AhJneN  and  (Ln)neN  be  two 
collections  of  i.i.d.  random  variables  taking  values  in  (E,  £),  such  that  Xn  ~  p  and 
Yn  ~  v  for  each  n  e  N.  As  p  u,  there  exists  A  e  £  such  that  p(A)  i/(A).  Dehne 

r  i  N 

B  lz  —  (zh  z2,  ...)e  EN  :  lim  —  V  IA{zn)  =  p(A) 

JV->oo  iV  L ' 

f  n= 1 

By  the  law  of  large  numbers  we  have 

=  1, 

=  0. 

□ 


N 


^(B)  =  p[|0e(l:M-J  1a(X„(w))  =  H(A) 


n=  1 
N 


v®(B)  =P|{»eB:  Urn  -  ^  U(Yn(u,))  =  fi(A) 


n=  1 


Proposition  2.14  attests  that  unless  two  measures  /i  and  z/  are  the  same,  their 
inhnite  products  /i®  and  v®  are  mutually  singular.  This  example  illustrates  the 
fundamental  reason  why  a  global  analysis  of  high- dimensional  models  is  not  suitable  to 
properly  describe  these  models.  On  the  other  hand,  these  models  can  be  meaningfully 
interpreted  by  looking  at  local  quantities.  To  see  this,  consider  the  case  \\/i  —  u\\  = 
e  1.  Then,  for  J  C  /  a  telescoping  argument  easily  gets  ||/i®  —  z^®||j  <  £  card  J, 
whereas  the  global  total  variation  bound  yields  ||/x®  —  v®\\  =  2.  By  bounding  the 
local  total  variation  distance  over  subsets  of  coordinates  in  terms  of  local  quantities 
(the  conditional  distributions  of  each  coordinate  given  all  the  others),  the  Dobrushin 
comparison  theorem  represents  the  key  tool  that  will  be  used  in  this  thesis  to  perform 
a  local  analysis  in  high-dimensional  models. 

While  infinite-dimensional  probability  measures  can  be  equivalent,  they  can  differ 
significantly  only  on  a  finite  number  of  coordinates,  as  the  following  example  taken 
from  [  14,  Chapter  9]  illustrates. 

Example  2.15.  For  each  a  e  M,  let  Xa  be  the  distribution  of  a  Gaussian  random 
variable  in  M  with  mean  a  and  variance  1.  Given  two  sequences  (an)n6N  and  (6n)ne n, 
define  the  following  product  measures  on  MN: 


Then,  p®  ~  u®  if  and  only  N(an  —  bn )2  <  oo. 

This  example  tells  us  that  in  order  to  be  equivalent  infinite-dimensional  probability 
measures  can  only  have  finitely  many  degrees  of  freedom  that  can  carry  significant 
information.  While  such  cases  represent  respectable  infinite-dimensional  models,  for 
the  sake  of  the  results  developed  in  this  thesis  we  think  of  them  as  being  effectively 
finite- dimensional.  On  the  other  hand,  in  this  thesis  we  will  be  interested  in  models 
that  are  genuinely  infinite- dimensional,  in  the  sense  that  they  are  constituted  by 
inhnitely-many  independent  degrees  of  freedom. 
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2.5  Monte  Carlo 


Given  a  measurable  space  (E,  £)  and  a  (possibly  random)  probability  measure  p 
on  it  (perhaps  known  only  up  to  a  normalization  factor),  in  this  thesis  we  will  be 
interested  in  the  problem  of  approximating  integrals  of  the  form  p/  =  f  fi(dx)f(x), 
for  a  suitable  class  of  M- valued  £-measurable  functions  /.  The  Monte  Carlo  approach 
consists  in  approximating  p/  with  the  sample  mean  of  /  under  p.  If  p  is  not  random, 
this  means: 

1  N 

M/  =  E/(X)»-£/(X(i)), 

i=  1 

where  X  ~  p,  and  X(l), . . .  ,X(N)  are  i.i.d.  random  variables  ( samples )  with  distri¬ 
bution  p,  for  a  certain  IV  >  1.  On  the  other  hand,  if  p  is  random  this  means: 

1  N 

lU  =  »f(Y)  =  E(/PO|F)  «  -  £  f(X(i)), 

i— 1 

where  Y  is  the  random  variable  through  which  the  randomness  in  p  comes,  as  pre¬ 
scribed  by  Remark  2.5. 

It  is  convenient  to  introduce  the  following  sampling  operator  on  probability  mea¬ 
sures  (Sx  denotes  the  Dirac  measure  with  mass  located  at  x  G  E). 

Definition  2.16  (Sampling  operator).  Let  p  be  a  (possibly  random)  probability  mea¬ 
sure  on  (E,  £).  Define  the  sampling  operator  SN  as 


2—1 


A"(l), . . . ,  X(N)  are  i.i.d.  samples  ~  p. 


As  S  vp  is  defined  in  terms  of  (possibly  conditionally,  cf.  Remark  2.5)  i.i.d.  random 
variables,  there  are  a  lot  of  results  to  assess  the  accuracy  of  (S N p)f  as  an  estimate  of 
p/.  In  particular,  as  N  goes  to  infinity  the  strong  Law  of  Large  Numbers  tells  us  that 
(S N fi)f  converges  almost  surely  to  p/,  while  the  Central  Limit  Theorem  tells  us  that 
VN  {(SjVp)/  —  p/}  converges  in  distribution  to  a  Gaussian  with  mean  0  and  variance 
p/2  —  (p/)2.  Non-asymptotic  results  can  also  be  easily  obtained,  such  as  bounds  on 
tail  probabilities  PjKS^p)/  —  p/|  >  t},  for  t  >  0,  and  bounds  on  error  moments 
E  |(S N n)f  —  p/|p,  for  p  >  1.  We  presently  prove  a  result  for  the  case  p  =  2,  as  this 
will  be  used  repeatedly  in  this  thesis.  We  refer  to  [  ]  for  a  systematic  collection  of 
these  results. 

Let  us  first  recall  the  bias/variance  decomposition  of  the  mean  square  error,  which 
is  one  of  the  most  analytically  tractable  measure  of  the  quality  of  an  estimator: 

E  ((S Pi)/  -  ,,/)2  =  E((S 'V)/  -  E(S "tiff  +  [E  (S'V)/  -  ft/f  . 

^ - V - ✓  V _ v _ ^ 

variance  bias2 


Clearly,  (S Nfi)f  is  an  unbiased  estimator  for  each  N  >  1,  as  E(SiVp)/  =  p/.  As 
for  the  variance  of  the  estimator,  we  have  the  following  lemma. 
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Lemma  2.17  (Monte  Carlo  variance).  Let  p  be  a  random  probability  measures  on 
(E,E),  and  let  Y  as  in  Remark  2.5.  For  each  positive  8,-measurable  function  f  we 
have 


e  (((S'V)/  -  A/)2|V)  =  1  {m2)  -  (m/)2}  ■ 

As  a  consequence, 

Proof.  Note  that 


((S  Nh)f~hf)2 


r  N  ,  N 

(r/)2  - E  /(*(*))  +  N2  E  (/(*(0))2 

2=1  2=1 

+  v?  E  /(A'«)  /(Aw»- 


By  definition  of  the  samples  X(l), . . . ,  X(N)  (see  also  Remark  2.5)  and  by  the  prop¬ 
erties  of  conditional  expectations  (recall  that  /if  is  crE-measurable) ,  we  have 

E  (((S '»/  -  m/)2V)  =  i>/)2  -  2  (M/)2  +  1 M/2)  +  (f,/)2 

=  1 W/2)  -  (fi/)2} . 

and  the  statement  follows  immediately.  □ 


The  Monte  Carlo  approximation  scheme  introduced  above  is  practicable  only  when 
it  is  possible  (and  computationally  convenient)  to  sample  from  the  distribution  /j 
itself,  the  so-called  target  distribution.  More  generally,  there  are  situations  where  it 
is  more  convenient  to  sample  from  another  distribution  v  on  (E,E),  which  is  then 
referred  to  as  the  importance  distribution  (or  proposal  distribution) .  The  importance 
sampling  paradigm  is  based  on  the  idea  that  we  can  approximate  /if  using  samples 
coming  from  v.  In  fact,  if  p  -C  v  then  the  Radon-Nikodym  theorem  (Theorem  2.6) 
yields 

1  N 

Hf  =  E  f(X)  =  E  %(Z)  f(Z)  «  -  ]T  /  (Z(i)), 

2=1 

where  X  ~  p,  Z  ~  u,  and  Z(l), . . . ,  Z(N)  are  i.i.d.  samples  with  distribution  v. 

More  generally,  in  this  thesis  we  will  deal  with  situations  where  the  target  dis¬ 
tribution  p,  or  the  instrumental  distribution  v,  or  both,  are  only  known  up  to  a 
scalar  factor.  In  this  case  also  the  Radon-Nikodym  derivative  jr)  is  also  known  up  to 
a  constant  factor.  Nonetheless,  the  importance  sampling  paradigm  can  still  be  im¬ 
plemented  by  considering  the  following  approximation  where  constant  factors  cancel 
out: 


m/  =  e/(x)  =  e£(z)/(z) 


Efc zmg) YL ftggMM) 

e &(Z)  ~  Ef„£(zW) 


where  X  ~  p,  Z  ~  u,  Z(l), . . . ,  Z(N)  are  i.i.d.  samples  with  distribution  u,  and 
we  have  used  that  E  ^jfi{Z)  =  p(E)  =  1.  The  self-normalized  importance  sampling 
paradigm  will  be  used  in  Chapter  3  to  describe  the  basic  algorithms  upon  which  much 
of  the  work  in  this  thesis  is  based.  For  this  reason,  we  introduce  a  sampling  operator 
also  for  this  case. 


Definition  2.18  (Self-normalized  importance  sampling  operator).  Let  p,u  be  (pos¬ 
sibly  random)  probability  measures  on  (E,E)  such  that  p  <C  v.  Define  the  self- 
normalized  importance  sampling  operator  S((  as 


N 

S?n  ■=  E  W{i)5z(i)i  Z(l), . . . ,  Z(N')  are  i.i.d.  samples  ~  u, 

i=  1 


where  the  weights  W(  1), . . . ,  W(N )  are  defined  as 


W{i) 


£<,m 

T,L£(my 


Clearly,  we  have  S1^ p  =  S N p.  Consistency  and  asymptotic  normality  are  easy 
to  prove,  and  now  VN  {(S N  p)f  —  pf}  converges  in  distribution  to  a  Gaussian  with 
mean  0  and  variance  E(^(Z)  (f(Z)  —  pf))2,  Z  ~  u.  The  self-normalize  estimate 
(S fip)f  is  biased  for  any  hxed  value  of  N,  and  establishing  non-asymptotic  results 
is  not  as  straightforward  as  for  the  ordinary  Monte  Carlo  approximation.  We  refer 
again  to  [  ]  for  details. 
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Chapter  3 

Classical  nonlinear  filtering  and 
particle  filters 


This  chapter  provides  an  overview  of  the  classical  theory  of  nonlinear  filtering  and 
sequential  Monte  Carlo  algorithms  known  as  particle  filters.  Emphasis  is  given  to 
the  stability  property  of  the  filter  distribution,  which  is  the  key  to  establish  time- 
uniform  error  bounds  for  particle  filters.  The  treatment  revolves  around  the  curse 
of  dimensionality  phenomenon,  and  the  coverage  is  instrumental  to  the  content  that 
will  be  developed  in  Chapter  4  and  Chapter  5.  The  presentation  is  inspired  by  [55] 
and  [8]. 


3.1  Hidden  Markov  models  and  nonlinear  filter 

Let  (X,  X)  and  (Y,  y)  be  two  Polish  spaces.  We  define  a  hidden  Markov  model  as 
a(IxY,l0  y)- measurable  Markov  chain  (Xni  Yn)n> 0  whose  transition  probability 
kernel  K  can  be  factored  as 

Kf(x,y )  =  J p(x,x')  g(x',y')ip(dx')<p(dy')  f(x',y'), 

for  each  i6X,i/6  Y  and  each  X  8)  ^-measurable  function  /.  Thus,  (Xn)n>o  is  itself 
a  Markov  chain  in  (X,  X)  with  transition  density  p  :  X  x  X  — )■  M+  with  respect  to 
a  given  reference  measure  -0,  while  (Yn)n> o  are  random  variables  in  (Y,  that  are 
conditionally  independent  given  (Xn)n>0  with  transition  density  g  :  X  x  Y  — )■  M+ 
with  respect  to  a  reference  measure  ip.  This  dependency  structure  is  illustrated  in 
Figure  3.1.  We  interpret  (Xn)n>0  as  an  underlying  dynamical  process — the  signal — 
that  is  not  directly  observable,  while  the  observable  process  (W)n>0  consists  of  partial 
and  noisy  observations  of  (An)n>o-  The  hidden  Markov  model  setting  is  convenient 
mathematically  and  is  ubiquitous  in  practice  as  a  model  of  noisy  observations  of 
random  dynamics. 

In  the  following  we  will  assume  that  the  process  ( Xn ,  Yn)n> 0  is  realized  on  its 
canonical  probability  space,  and  denote  for  any  probability  measure  p  on  (X,  X)  by 
PM  the  probability  measure  under  which  (Xn,  Yn)n>0  is  a  hidden  Markov  model  with 
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Figure  3.1:  Dependency  graph  of  a  hidden  Markov  model. 


transition  probability  P  as  above  and  with  initial  condition  Xo  ~  /J  (if  we  simply  write 
Pr  then  it  means  that  any  choice  of  the  initial  measure  would  yield  equivalent  results 
for  the  argument  being  considered).  For  x  G  X,  we  write  for  simplicity  Px  :=  P5*. 
As  the  process  (Xn)n>0  is  unobservable,  a  central  problem  in  this  setting  is  to  track 
the  unobserved  state  Xn  given  the  observation  history  . . .  ,Yn:  that  is,  we  aim  to 
compute  the  nonlinear  filter 

<:=  PM(XnG  ■|F1,...,Xn). 

Filtering — the  computation  of  the  conditional  distributions  of  a  hidden  Markov 
process  given  observed  data — is  a  problem  that  arises  in  a  wide  array  of  applications 
in  science  and  engineering,  classically  in  the  field  of  tracking,  speech  recognition,  and 
finance.  We  refer  to  [8]  for  a  rich  list  of  applications. 

Remark  3.1  (A  matter  of  notation).  To  be  precise,  given  our  definition  of  conditional 
distributions  (Definition  2.f),  we  should  write  7r£(Y\:n,  •)  instead  ofnlf.  However,  in 
what  follows  we  only  use  the  kernel  notation  n£(yi:n,  dx)  to  emphasize  the  dependence 
of  the  filter  on  a  particular  sequence  of  observations  Yi:n  =  y1:n.  Hence,  we  interpret 
7 as  a  random  measure  whose  randomness  is  (implicitly)  provided  by  the  observations 

Being  a  conditional  distribution,  the  filter  yields  least  mean  square  estimates,  and 
for  this  reason  it  is  often  referred  to  as  the  optimal  filter. 

Lemma  3.2  (Optimality  of  the  filter).  Fix  n  >  0.  Let  f  be  a  measurable  function 
such  that  E ^  f(Xn)2  <  oo.  Then, 

</  =  arg  min  EM  (/(Xn)  -  h(Yfn))2, 

h 

where  the  minimization  is  over  measurable  functions  h. 

Proof.  It  follows  immediately  from  Lemma  2.2,  choosing  X  =  /( Xn)  and  Y  = 

(W,...,W)-  □ 

If  the  conditional  distribution  TTn  can  be  computed,  it  yields  not  only  a  least  mean 
square  estimate  of  the  unobserved  state  Xn,  but  also  a  complete  representation  of  the 
uncertainty  in  this  estimate. 

An  important  property  of  the  filter  is  that  it  can  be  computed  recursively,  which 
follows  immediately  from  Bayes  formula  (Lemma  2.8). 
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Lemma  3.3  (Filter  recursion).  The  filter  distribution  nfr  can  be  computed  recursively 
according  to 

=  /  <_i  (dx)  p(x,  xf)  'f(dx')  gfxf  Yn)  f(x') 

f  7r^_1(dx)p(x,x')fj(dx')  g(x',Yn) 
with  the  initial  condition  7 rtf  =  p. 

Proof.  Fix  n  >  0.  By  construction  X0:n  has  distribution  p  given  by 

p(dx0:n )  :=  PM(^0:n  €  eh0:n)  =  £t(cbo)  p(x0,  ^^(cbl)  '  '  '  P^n-l,  Xn)^(dxn) . 
Define  the  provability  measure  A  on  (Yn,  Vn)  as 

A(ch/i:n)  :=  <p(dyi)  ■  ■  ■  ip(dyn), 
and  define  the  positive  function  7  as 

Ul:n)  ■  9{^li  2/l)  Vn)- 


By  construction  we  have 

f(Xom,  F l:n)  p(dXQ:n)  A(cfi/i:n)  7(‘^'1:td  Vim)  f(-^0:m  Z/l:n) 

for  each  positive  measurable  function  /.  By  Bayes  formula  (Lemma  2.8)  we  have  that 
the  conditional  distribution  of  X0:n  given  Yi:n  is  given  by  the  probability  kernel  P 
defined  as 

/  p(dx0;n )  7(^0:n,  Pl:n)  /(^Om) 


Pf(Yl:n)  =  I  P^Xo-.n  6  dx0:n I *l:n)  / (*0:n)  = 
It  is  immediately  verified  that 

</  =  [  P{Yi:n,  dx 0:n)  /(xn)  = 


/  p(dz0:n)  7^0:71,  Pl:n) 
/  p(dx0:n)  7(^0:n,  *l:n)  /fan) 


/  p(dx0:n )  7(^0:n,  Pl:r. 
/  TTn-iW  p(g,  gQ  ^(dxQ  1„)  /(xQ 
/  7^((_1(c^a;)p(a:,a;,)  'ip(dx')  g(x' ,Yn) 


□ 


The  recursive  structure  of  the  nonlinear  filter  is  of  central  importance,  as  it  allows 
the  filter  to  be  computed  on-line  over  a  long  time  horizon.  Nonetheless,  the  recur¬ 
sion  is  still  at  the  level  of  probability  measures,  and  in  general  no  finite-dimensional 
sufficient  statistics  exist.  Important  exceptions  are  two  special  cases:  linear  Gaussian 
models  (which  give  rise  to  the  celebrated  Kalman  filter)  and  models  with  a  (small) 
finite  state  space,  cf.  [8].  However,  most  complex  models  do  not  fall  into  these  very 
limited  categories.  Therefore,  the  practical  implementation  of  nonlinear  filters  typ¬ 
ically  proceeds  by  sequential  Monte  Carlo  approximations  known  as  particle  filters. 
We  refer  to  [19]  for  a  survey  on  these  methods.  In  the  present  context  we  limit  ourself 
to  describe,  in  their  basic  formulations,  the  main  two  algorithms  that  have  been  con¬ 
sidered  in  the  filtering  literature.  We  present  these  algorithms  in  the  light  of  the  curse 
of  dimensionality  phenomenon  that  affects  both  of  them,  which  will  be  instrumental 
for  the  material  developed  in  Chapter  4  and  Chapter  5. 
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3.2  Sequential  importance  sampling 


One  of  the  first  Monte  Carlo  algorithm  that  was  used  to  approximate  the  filter  distri¬ 
bution  is  the  sequential  importance  sampling  (SIS)  particle  filter.  The  introduction 
of  this  algorithm  can  be  traced  back  to  the  pioneering  work  of  Handschin  and  Mayne 
in  1969  [30].  The  idea  behind  the  SIS  algorithm  is  to  apply  the  self-normalized  im¬ 
portance  sampling  paradigm  introduced  in  Section  2.5  to  approximate  the  so-called 
smoothing  distribution  PM(A0:n  G  ■  Yi:n),  and  then  compute  the  marginal  at  time  n 
to  approximate  the  filter  7 =  PM(An  G  •  \Yi:n). 

To  see  how  the  SIS  works,  fix  n  >  1  and  assume  that  we  are  given  the  observations 
Yi, . . .  ,Yn.  Our  goal  is  to  approximate  integrals  with  respect  to  the  (random)  measure 
7 r^.  From  the  proof  of  Lemma  3.3  we  know  that  the  conditional  distribution  of  X0:n 
given  Y1:n  is  given  by  the  kernel  P  defined  as 


P(^  l:nj  d37g:rl) 


PAt(^-0:n  G  dXo:n\Y]_.n)  —  — 


p(dx0:n)  g(xi,Yi)  ■  ■  ■  g(xn,  Yn), 


where 


p{dx0:n)  :=  PM(X0:n  G  dx 0:n)  =  p(dx0)  p(x0,  xi)^(dxi)  ■  ■  ■  p(xn-!,  xn)^(dxn) 


and 

Z  ■■=  I  p(dx 0:n)  g(x1,Y1)  ■  ■  ■  g(xn,  Yn).  (3.1) 

At  first  sight,  we  might  think  of  using  straightforwardly  the  Monte  Carlo  approxima¬ 
tion  (recall  the  definition  of  the  sampling  operator  SAr,  Definition  2.16) 

1  N 

~  /  (SNPYlJ(dx0:n)  f(xn)  = 

d  i= i 

where,  for  each  i  G  {1, . . . ,  A^},  X{i)  :=  (A7"0(f), . . . ,  X n(i))  is  an  independent  sample 
from  the  distribution  Pyl  n  (conditionally  independent  given  Yx , . . .  ,Yn,  see  Remark 
2.5).  Of  course,  the  problem  with  this  approach  is  that  in  general  we  do  not  know  how 
to  sample  from  Py,„  .  However,  by  construction  it  is  usually  easy  to  sample  from  the 
signal  Markov  chain  (An)n>o-  This  is  the  case,  for  instance,  if  the  signal  is  modeled 
as  a  recursion 

Xn  fi( An_i,£rj),  n  ^  1, 

where  (£n)n>i  are  i.i.d.  random  variables  having  a  distribution  that  can  be  efficiently 
sampled  (for  instance,  the  uniform  distribution  or  the  Gaussian  distribution),  and  h 
is  a  non-random  function  that  we  know  pointwise.  In  this  case,  in  fact,  we  can  sample 
Xn  ~  p(xn- 1,  •  by  sampling  first,  and  then  computing  Xn  =  h(xn-i,£n). 

This  fact  suggests  to  use  importance  sampling  choosing  p  as  importance  distribution 
and  P Y\-n  as  target  distribution.  The  Radon-Nikodym  derivative  reads 

dp  r  \ 

J1:n  (xo:n)  =  -^g(x1,Y1)---g(xn,Yn). 
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Since  the  normalization  constant  Z  is  not  easy  to  compute  (else,  again,  computing 
the  filter  distribution  would  not  be  a  problem  in  the  first  place),  then  we  apply  the 
self-nomalized  importance  sampling  operator  (Definition  2.18)  to  get 

N 

<f(Vl:n)  »  /  (S? PYlJ(dx0:n)  f(xn)  =  V  Wn(V)  f(Xn(i)), 

1=1 


where  for  each  i  e  {1, . . . ,  iV}  we  have 

w  .  ,=  g(X1(j),Y1)---g(X„(j),Yn) 

'  Ef,i 9( A'i(c), v.)  •  •  •  S(I,(4 U) 

and  A"(z)  :=  (A0(i), . . . ,  An(i))  is  an  independent  sample  from  the  distribution  p. 
Note  that  the  weights  Wn(  1), . . . ,  Wn(N )  are  positive  and  they  sum  to  1,  and  they 
depend  on  the  (random)  observation  sequence  Y\ , . . .  ,Yn.  So,  the  SIS  particle  filter 
approximation  at  time  n  is  given  by 


K(dxn) 


N 

(S?  PYlJ(dx0:n)  *)&„«)(<&„)■ 

i=l 


A  key  observation  is  that  the  weights  can  be  computed  recursively,  namely, 


wn(i)  cx  Wn-^giX^Yn),  Wo(i)  =  1/N,  (3.2) 

where  the  proportionality  is  up  to  the  normalization  factor  so  that  W"n(i)  =  1. 
This  fact  suggests  that  the  SIS  particle  filter  can  be  implemented  in  an  on-line  fashion, 
as  described  in  Figure  3.2.  Figure  3.3  illustrates  a  typical  iteration  of  the  algorithm. 


Algorithm  1:  SIS  particle  filter 
Data:  Fix  n,N  >  1.  Let  the  observations  Yi, . . .  ,Yn  be  given. 

Sample  A0(i),  i  —  1, . . . ,  N  from  the  initial  distribution  /i; 

Set  W0(i)  =  1/AT,  i  —  1, . . . ,  N; 

for  k  —  1, . . . ,  n  do 

Sample  i.i.d.  Xk(i)  rv_/  p(Xfc_i(i),  •)dil),i  =  l,...,N] 

Compute  Wk(i)  =  W^i)  g(Xk(i),  Yk)/ E,=r  Wk^ {£)  g(Xk(£),  Yk), 

Let  K  =  Eii wn{i)  fix*®; 

Compute  the  approximate  filter  7 r^/  ~  7 r^/. 


Figure  3.2:  The  classical  sequential  importance  sampling  (SIS)  particle  filter. 

For  any  fixed  time  n  >  1,  the  quality  of  the  estimates  obtained  by  the  SIS  particle 
filter  as  a  function  of  the  number  of  particles  N  can  be  easily  assessed  by  the  general 
theory  on  self- normalized  importance  sampling,  see  Section  2.5.  In  particular,  the 
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SIS  particle  filter  does  indeed  approximate  the  exact  nonlinear  filter  as  N  goes  to 
infinity  with  the  typical  Monte  Carlo  l/VfV-rate  for  the  mean-square  error,  namely, 

where  Cn  is  a  constant  that  depends  on  time  n. 

(a)  (b)  (c) 


Figure  3.3:  Representation  of  a  single  iteration  of  the  SIS  particle  filter  in  the 
case  when  the  state  and  observation  spaces  are  X  =  Y  =  R+,  and  when  there  are 
N  =  6  particles  considered  by  the  algorithm.  Each  particle  is  represented  by  a  blue 
ball,  whose  size  is  proportional  to  the  weight  of  the  particle,  (a)  Representation 
of  7rn_ i .  (b)  Particles  are  propagated  forward  using  the  underlying  dynamics,  (c) 
Particles  are  reweighed  according  to  the  likelihood  of  the  new  observation  at  time 
n  (whose  level  sets  are  drawn  in  orange)  yielding  7fra,  following  the  multiplicative 
weight  recursion  (3.2). 


3.2.1  Sample  degeneracy  with  time 

The  SIS  algorithm  is  a  sequential  implementation  of  the  general  importance  sampling 
paradigm  (“sequential”  in  the  sense  that  there  is  no  need  of  regenerating  the  popu¬ 
lations  of  samples  from  scratch  at  the  arrival  of  new  observations).  It  turns  out  that 
importance  sampling  is  usually  very  inefficient  in  high-dimensional  models,  so  that 
the  SIS  particle  filter  performs  poorly  as  time  increases.  The  issue  comes  from  the  fact 
that  importance  sampling  employs  a  finite  number  of  samples  from  PM(Ao;n  G  • )  to 
approximate  the  target  distribution  PM(X0:n  G  ■  Yi:n),  and  the  approximation  does 
not  work  well  if  the  two  distributions  are  too  far  apart,  which  is  what  happens  if  time 
n  is  large.  In  practice  the  SIS  algorithm  fails  because  the  distribution  of  the  weights 
Wn(  1), . . . ,  Wn(N)  degenerates  as  time  n  increases,  and  essentially  only  one  particle 
is  left  with  a  non-zero  weight  after  a  few  time  steps  (recall  that  at  each  time  step 
the  the  weights  sum  to  1  by  construction).  This  phenomenon  is  known  as  collapse  or 
sample/weight  degeneracy.  The  following  example  clarifies  this  issue. 
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Example  3.4  (Weight  degeneracy  of  SIS  with  time).  In  the  framework  introduced  in 
Section  3.1,  consider  the  hidden  Markov  model  where  ( Xn)n>0  is  a  symmetric  random 
walk  in  Z2  with  X0  =  i£Z2,  and  for  each  n  >  1  we  have  Yn  =  Xn+egn,  where  e  G  R+ 
and  (r/n)n>0  is  a  collection  of  i.i.d.  random  variables  having  the  standard  Gaussian 
distribution  in  R2  ( zero  mean  and  identity  covariance  matrix).  If  the  signal-to-noise 
ratio  is  high,  that  is,  if  s  is  very  close  to  0,  then  we  expect  the  smoothing  distribution 
Pa:(X0:n  G  •  \  Y\.n)  to  be  very  concentrated  around  X0:n,  the  true  location  of  the  path 
of  the  signal  up  to  time  n.  However,  if  we  sample  N  particles  from  the  distribution 
Pa:(X0:n  G  •),  where  each  particle  represents  a  path  of  n  steps  of  the  symmetric 
random  walk,  then  only  a  fraction  of  the  particles  will  be  close  to  any  given  trajectory 
in  Z2,  and  the  problem  clearly  gets  worse  as  time  increases. 

The  phenomenon  of  weight  degeneracy  of  the  SIS  algorithm  with  time  has  been 
analyzed  in  various  settings.  The  following  example  (adapted  from  Example  7.3.1  in 
])  analyzes  the  poor  performance  of  the  SIS  algorithm  asymptotically  (in  the  limit 
N  — y  oo)  as  time  increases. 

Example  3.5  (Exponential  growth  of  the  SIS  asymptotic  variance  with  time).  In 
the  general  framework  introduced  in  Section  3.1,  consider  the  hidden  Markov  model 
where  (An)n>0  is  a  product  of  i.i.d.  random  variables  with  distribution  //  (that  is, 
p(x,  ■  )if)  —  n  for  each  x  G  Xj.  Then,  for  each  time  n  >  1  we  have 

in  distribution 

N1  2(Xnf  ~  nnf)  - >  Gaussian  (0,  cx2(/))  as  N  — »  oo, 

with 

f)  '■=  c(f)  7n“\ 

where  c(f)  and  7  are  constants  that  do  not  depend  on  n,  c(f)  >  0  as  long  as  f  is  not 
a  constant,  and  7  >  1  as  long  as  the  observation  density  g  is  different  from  1. 

First  of  all,  as  (Xn)n>0  is  a  collection  of  i.i.d.  random  variables  with  distribution 
H,  it  follows  that  also  (Yn)n>  1  is  a  collection  of  i.i.d.  random  variables  with  distribution 

P(Yi  G  A)  =  I  fj,(dx)  g(x,y)  ip{dy)  1  A(y)- 


For  each  x  G  X,  y  G  Y  define 


9{x,y)  ■= 


g(x,y) 


f  fi(dx)g(x,y)' 

Then,  for  each  N  >  1,  n  >  1  we  have 

n-1'2  Eii(/(WW)-</)rc,iS(ww,n) 


N1/2Kf  -  Kf)  = 


N-'  EiiflLi 


(3.3) 


where  (Xk(i)),  i  G  {1, . . . ,  N},  k  G  {1, . . .  ,n},  is  a  collection  of  i.i.d.  random  variables 
with  distribution  p,  conditionally  independent  given  Yf, . . . ,  Yn.  By  independence,  for 
each  i  and  k  we  have 


E  g{Xk(i),Yk)  =  EE(g(Xk(i),Yk)\Yk)  =  E 


p(dx)g(x,  Yk )  =  1, 
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and  the  strong  Law  of  Large  Numbers  yields,  as  N  — »  oo 


N  n 


nlmnst  Q'iivpI'ii 

n-1  - 1  Ejl5(^(i),n)  =  nEj(^(i),n)  =  i 

z—  1  k=  1  k= 1  k= 1 


for  the  denominator  in  (3.3).  On  the  other  hand,  as 


J  fJ>(dx)  g(x,  Yn)  f(x), 


by  independence  it  is  immediately  verified  that 

n 

E(f(x„(i))  - <f)JJs(xM.n)  =  o 

k= 1 


and 


where 


<(/)■=  E  (/(X„(l))-</)n«(^(l),n)  =4f)  7 


n—  1 


k=  1 


c(/)  :=E  I  n(dx)  (f(x)  -nff)2  g(x,Y\)2, 
7  :=E  f  l*(dx)g(x,Y i)2. 


T/ie  Central  Limit  Theorem  yields  that  the  numerator  in  (3.3)  converges  in  distribu¬ 
tion  as  N  — >  oo  to  a  Gaussian  distribution  with  mean  0  and  variance  cr2(f).  There¬ 
fore,  it  is  immediate  that  also  (3.3)  converges  in  distribution  to  the  same  Gaussian 
distribution.  Applying  Jensen’s  inequality  twice  we  get 


1  =  E  /  fi(dx)  g(x,  Ei)  )  <  E  /  //(oh)  g(x,  E)  <  E  /  fi(dx)  g(x,  Ei)2  =  7 


Thus,  the  asymptotic  variance  of  the  SIS  algorithm  increases  exponentially  with  time 
as  long  as  g  is  different  from  1 . 

The  analysis  in  Example  3.5  can  be  extended  to  more  general  models.  However, 
even  for  linear  Gaussian  models  where  computations  can  be  carried  out  explicitly, 
the  analysis  becomes  much  more  involved  (we  refer  to  [8]  and  references  therein).  In 
practice,  weight  degeneracy  is  a  major  limitation  that  has  render  the  SIS  particle  filter 
largely  useless  in  many  applications  where  one  is  interested  in  tracking  the  underlying 
state  reliably  for  more  than  a  few  time  steps. 

In  the  next  section  we  show  that  a  modification  of  the  sampling  scheme  considered 
so  far  can  produce  samples  that  have  a  closer  distribution  to  the  filter  This  yields 
a  new  algorithm  that  can  overcome  the  degeneracy  of  the  weights  with  time. 
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Remark  3.6  (Importance  sampling).  In  the  literature  (see  [19]  for  instance)  the 
term  “sequential  importance  sampling”  is  generally  used  to  indicate  a  more  general 
algorithm  than  the  one  we  just  described.  This  term  is  used  in  the  case  where  the 
importance  distribution  being  used  in  the  importance  sampling  paradigm  corresponds 
to  the  law  p'  of  a  given  ( possibly  time-inhomogeneous)  Markov  chain  ( Zn)n>0 ,  which 
can  differ  from  the  law  p  of  the  signal  Markov  chain  (Xn)n>0.  The  idea  is  to  choose  an 
importance  distribution  that  is  as  close  as  possible  to  the  target  distribution  PM(X0:n  G 
■  |Yi:n);  so  to  improve  the  performance  of  the  algorithm  and  possibly  alleviate  weights 
degeneracy  with  time.  Presently,  we  limit  ourself  to  describe  this  more  general  version 
of  the  SIS  algorithm,  and  we  refer  to  the  discussion  developed  in  Section  3.3.3  to 
understand  why  importance  sampling  can  not  tackle  the  curse  of  dimensionality  at  a 
fundamental  level. 

To  make  the  point,  fix  n  >  1,  define 

p\dx 0:n)  :=  p(dx0)q1(x0,x1)ip(dx1)  ■  ■  ■  qn(xn_i,  xn)ip(dxn), 
and  assume  that  for  each  k  G  {1, . . . ,  n} 

(x,A)  G  (X,  X)  — >  J  qk(x,  x')i>(dx')lA(x') 


is  a  given  transition  kernel  so  that  p(x,  ■  )i[  <C  qk(x,  •  for  each  x  G  X.  Then,  the 
Radon-Nikodym  derivative  reads 

dPy1:n  ,  ^  _  1  p(x0,xi)  g(xi,Yi)  p(xn-i,  xn)  g(xn,  Yn) 

dp'  'n  Z  qi{xQ,xf)  qn{xn-i,xn) 

where  Z  is  defined  in  (3.1).  In  this  case  the  self-normalized  importance  sampling 
paradigm  yields 

N 

<!(Yy.n)  «  /  (S$PY,J(dx„:n)  f(xn)  =  J2  wy*)  /(Zn(i)), 

J  i=  1 


where  for  each  i  G  {1, . . . ,  N }  the  weight  recursion  now  reads 


Wn(i)  (x  W^ii) 


pjZn-ijfh  Zn(i ))  g(Zn(i),  Yn ) 

qn(,Zn—\(if  Zn(i) ) 


W0(i)  =  1/N, 


( the  proportionality  is  always  up  to  the  normalization  factor  so  that  Y^iLi  —  ^-)i 

and  each  Z(i)  :=  (Z0(i), . . . ,  Zn(i))  is  an  independent  sample  from  the  distribution 
p' ,  conditionally  independent  given  Y\ , . . . ,  Yn  (see  Remark  2.5).  Clearly,  if  we  choose 

Ql1)  •  •  •  ?  Qn  0'S 


qk(x,x')'if(dx')  :=  P(AX-  G  dx'\Xk-i  =  x)  =  p(x,x')'if(dx'), 


then  we  recover  the  SIS  algorithm  introduced  in  the  main  text.  Another  popular  choice 
in  the  literature  is  given  by 


ql(x,x')i>(dx')  :=  P{Xk  G  dx'\Xk_x  =  x,Y1:k) 


p(x,x')  g(x' ,Yk) 

J  p(x,  x')  g(x',  Yk)  ijj(dx') 


i(j(dx'), 
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which  yields  the  following  weight  recursion 


W*k{i )  oc  W^ii)  J p(Zk_1(i), x')  g(x',  Yk)  ip(dx'),  W0(i)  =  1/N. 

The  distribution  p1*  obtained  with  this  choice  is  the  so-called  optimal  distribution.  In 
this  context  the  adjective  “optimal”  refers  to  the  fact  that  the  conditional  variance  of 
the  weights  at  each  time  step  ( given  all  the  samples  already  generated  by  the  algorithm) 
is  zero,  namely, 

Var(Wn*(i)  I  Zk(j),  N})  =  0, 

as  W*{i)  does  not  depend  on  Zn(i),  i  G  {1, . . . ,  N}. 

3.3  Sequential  importance  resampling 

One  of  the  key  property  of  the  filter  distribution  is  that  it  can  be  computed  recursively: 
in  order  to  compute  itf  we  only  need  to  know  tt f_1  and  Yn  (Lemma  3.3).  Despite 
the  fact  that  the  SIS  algorithm  has  an  iterative  implementation  (Figure  3.2),  the  way 
we  derived  this  algorithm  does  not  capture  the  recursive  structure  of  the  filter,  as 
the  importance  sampling  paradigm  was  applied  to  the  entire  smoothing  distribution 
PA'(X0:n  G  ■  Y\  :?j) ,  for  a  fixed  time  n. 

It  seems  natural  to  seek  for  a  Monte  Carlo  approximation  that  can  match  the 
recursive  nature  of  the  filter.  The  most  popular  algorithm  of  this  type  is  the  sequential 
importance  resampling  (SIR)  particle  filter  (also  known  as  bootstrap  particle  filter ) 
introduced  in  1993  by  Gordon,  Salmond  and  Smith  in  1993  [28],  which  simply  inserts 
a  sampling  step  in  the  filter  recursion.  To  define  this  algorithm,  let  us  rewrite  the 
Bayes  recursion  as  follows: 

K  =  /i,  <  =  F  n^n-\  («>!), 

where 

=  fp(dx)p(x,x')jj(dx')g(x',Yn)f(x') 
n ^  J  p(dx)p(x,  x')  if(dx')  g(x',Yn) 

It  is  instructive  to  write  the  recursion  Fn  :=  CnP  in  two  steps: 

prediction  correction 

K-l  - F  Pnn- 1  - >  <  =  CnPTT^!, 

where 

(P p)f  ■=  f  p(dx)p(x,x,)fj(dx')  f(x'), 

(C  x  r  f  p(dx)  g(x,Yn)  f(x) 
nP  ’  f  p(dx)g(x,Yn) 

In  the  prediction  step,  the  Liter  7r((_1  is  propagated  forward  using  the  dynamics  of 
the  underlying  unobserved  process  (. Xn)n>0  to  compute  the  predictive  distribution 
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PM(X„  G  •  |Yi, . . . ,  K„_ i).  Then,  in  the  correction  step  the  predictive  distribution  is 
conditioned  on  the  new  observation  Yn  to  obtain  the  filter  7 r^. 

The  SIR  algorithm  approximates  i by  the  empirical  distribution  7r^  computed 
by  the  recursion 

K  '■=  P  K  ■=  F nK-i  ( n  >  !), 

where  Fn  CnSA  P  consists  of  three  steps 

prediction  sampling  correction 

K-i - >  PK-i - >  $NPK-i - >  K  ■=  c„sNPK-v 

Here  N  >  1  is  the  number  of  particles  used  in  the  algorithm,  and  SN  is  the  sampling 
operator  defined  in  Definition  2.16.1 

It  is  straightforward  to  check  that  if  Z  ~  p  and  Z'  ~  P{Z,  ■ ),  then  Z'  ~  P p.  So, 
at  each  time  step  n  >  1,  in  order  to  draw  N  independent  samples  from  the 

SIR  algorithm  draws  N  independent  samples  from  i f^_1;  namely, 


and  then  samples 


Then 


~  K-i  i  G  {1, . . . ,  N}, 


Xn(i)  ~P(Zn_!(i),  ■)  ie  N}. 


1 


N 


sNP<-,  =  ^E^»w. 

i= 1 


and  by  applying  Cn  we  finally  get 


N 


7T^  :  = 
n 


^  Wn(i)  8. 


xn(i), 


i= 1 


where 


Wn{i) 


g{Xn{i),Yn) 
Eti  9(Xn(£),Yn) 


i  G  {1,  •  ■  • ,  N}. 


(3.4) 


Instead  of  repeatedly  updating  the  weights  as  in  the  SIS  algorithm,  cf.  (3.2),  the  SIR 
algorithm  resets  all  the  weights  to  1/N  at  each  iteration,  before  updating  them  in 
the  correction  step  using  the  likelihood  of  the  new  observation.  The  implementation 
of  the  algorithm  is  described  in  Figure  3.4. 

The  process  of  sampling  from  the  distribution  tt^-i  usually  referred  to  as  the 
resampling  step ,  as  N  particles  are  sampled  from  an  empirical  measure  that  is  itself 
defined  via  N  particles,  specifically, 


N 

K-1  =  ^kFn_i (i)<$x„-l(i)  Xn_i(l),...,Xn_i(IV)  are  i.i.d.  ~  Pi r£_2. 

1=1 

Hn  the  SIR  algorithm  the  sampling  operator  is  applied  iteratively  in  time.  At  each  iteration 
of  the  algorithm,  samples  are  drawn  conditionally  independent  given  the  collection  of  all  random 
variables  generated  by  the  algorithm  up  to  that  iteration. 
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Algorithm  2:  SIR  particle  filter  /  Bootstrap  particle  filter 
Data:  Fix  n,  N  >  1.  Let  the  observations  Y\ , . . .  ,Yn  be  given. 

Let  7Tq  =  fjb] 

for  k  =  1, . . . ,  n  do 

Sample  i.i.d.  Zk_i(i),  i  —  1, . . . ,  N  from  the  distribution 
Sample  Xk(i )  ~  p(Zk_1{i ),  ■ )  dip,  i  =  1, . . . ,  A; 

Compute  Wfc(i)  =  g(Xk(i),  Yk )/  £f=1  $(**(*),  Ffc),  i  =  1, . . . ,  IV; 
_  Let  K  =  Y^=iwk(j)  5xk{i)\ 

Compute  the  approximate  filter  7 r^/  ~  7 t£f. 


Figure  3.4:  The  classical  sequential  importance  resampling  (SIR)  particle  filter. 


(a) 


(b)  (c)  (d) 


Figure  3.5:  Representation  of  a  single  iteration  of  the  SIR  particle  filter  with  X  = 
Y  =  and  N  =  6.  Each  particle  is  represented  by  a  blue  ball,  whose  size 
is  proportional  to  the  weight  of  the  particle,  (a)  Representation  of  7Tn_i.  (b) 
Resampling  step:  N  particles  are  sampled  independently  with  replacement  and 
weights  are  reset  to  1/7V.  If  a  number  m  is  attached  to  a  particle,  then  there  are 
m  particles  sharing  the  same  location,  (c)  Particles  are  propagated  forward  using 
the  underlying  dynamics,  (d)  Particles  are  reweighed  according  to  the  likelihood  of 
the  new  observation  at  time  n  (whose  level  sets  are  drawn  in  orange)  yielding  i fra, 
following  the  weight  recursion  (3.4). 


In  the  resampling  step  particles  with  with  low  weights  are  less  likely  to  be  sampled 
than  particles  with  high  weights.  So,  in  the  resampling  step  some  of  the  particles 
with  low  weights  will  disappear,  while  particles  with  large  weights  will  be  sampled 
more  than  once.  Figure  3.5  illustrates  a  typical  iteration  of  the  algorithm. 

The  resampling  step  is  the  basic  mechanism  that  allows  the  SIR  algorithm  to 
overcome  the  weight  impoverishment  problem  of  the  SIS  algorithm  with  time  (Section 
3.2.1).  In  the  next  section  we  make  this  intuition  precise  by  providing  a  detailed  error 
analysis  for  the  SIR  particle  filter. 
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3.3.1  Filter  stability  and  time-uniform  error  bounds 

While  the  convergence  analysis  for  the  SIS  particle  filter  is  straightforward  as  the 
algorithm  is  defined  in  terms  of  a  collection  of  independent  particles,  for  the  SIR 
algorithm  the  situation  is  more  involved  as  at  each  iteration  the  resampling  step 
introduces  dependency  among  particles  (for  example,  recall  that  particles  with  high 
weights  are  likely  to  be  duplicated).  Nonetheless,  it  is  easily  shown  that  for  each 
n  >  1  the  particle  filter  converges  to  the  exact  filter  7r^  as  N  — >  oo.  To  gain  some 
insight  into  the  approximation  properties  of  the  SIR  particle  filter,  let  us  perform 
the  simplest  possible  error  analysis.  Recall  from  Section  2.3  the  following  distance 
between  (possibly  random)  probability  measures  p,  p'  on  X: 

IIP  -  p' III  :=  sup  y/E  ( pf  -  p'f  )2. 

I/I<i 

From  Lemma  2.10  and  Lemma  2.17  we  have 

II|Pp-Pp'III<IIIp-p'III,  IIIp  -  s'Vlll  <  -j=. 

Let  us  assume  for  simplicity  that  the  observation  density  g  is  bounded  away  from 
zero  and  infinity,  that  is,  k  <  g(x,y)  <  ft-1  for  some  0  <  k  <  1.  From  Lemma  2.9 
(choosing  g{x)  :=  g(x,Yn ))  we  obtain 

|||Cnp  -  C„p'|||  <  2k~2\\\p  -  pill. 

Putting  these  bounds  together  and  using  the  triangle  inequality  for  the  metric  |||  •  ||| 
we  find 


I F„p  -  Fnp'IH  =  |||CnPp  -  CnSJVPp/|||  <  2k~2  {|||Pp  -  Pp'IH  -  |||Pp'  -  SNPp' 

1 


<  2k 


-2 


Pill  + 


Vn) 

By  iterating  this  inequality  n  times,  using  that  tTq  =  7Tq,  we  find 


<“<111  <  2« 


-2 


|<_i  -<_i|||  + 


< 


cn 


y/N  J  y/N’ 


with 

n 

Cn  ■=  X(2k-2)‘. 

1=1 

So,  for  a  fixed  time  n  >  1  the  bootstrap  particle  Liter  does  indeed  approximate  the 
exact  nonlinear  Liter  as  the  number  of  particles  N  goes  to  inhnity,  with  the  typical 
Monte  Carlo  l/y/iV-rate. 

In  many  applications,  however,  one  needs  to  have  good  estimates  for  the  Liter  at 
arbitrary  times.  This  is  the  case,  for  instance,  of  target  tracking,  where  the  goal  is  to 
continuously  track  the  location  of  the  target.  The  analysis  that  we  have  performed  so 
far  does  not  guarantee  that  the  SIR  particle  Liter  can  be  successfully  applied  to  this 
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end,  as  Cn  grows  exponentially  in  time  n.  Fortunately,  the  exponential  growth  of  the 
error  is  an  artifact  of  our  crude  bound  and  typically  does  not  occur  in  practice.  The 
reason  why  the  constant  Cn  obtained  above  growths  with  time  is  that  we  have  per¬ 
formed  a  recursive  error  analysis  of  the  algorithm:  we  bounded  the  error  committed 
at  each  time  step,  and  we  naively  iterated  this  bound  for  n  steps,  so  that  the  error 
accumulates  over  time. 

We  presently  show  that  a  more  refined  analysis  that  exploits  the  behavior  of  the 
filter  distribution  itself — instead  of  working  at  the  level  of  the  filter  recursion — yields 
the  following  time-uniform  error  bound: 


sup  |||<  -  Kill  < 

n>  0 


c 

yr 


where  C  is  a  constant  that  does  not  depend  on  time.  This  is  the  reason  why  the  SIR 
algorithm  has  proved  to  perform  extraordinarily  well  in  many  classical  applications 
such  as  target  tracking,  speech  recognition,  and  finance  [8]. 

The  property  of  the  filter  that  allows  this  analysis  is  the  so-called  filter  stability 
property,  which  roughly  says  that  K  forgets  its  initial  condition  /z  as  n  — >  oo.  As  first 
realized  by  Del  Moral  and  Guionnet  in  2001  [15],  the  stability  property  provides  a 
dissipation  mechanism  that  mitigates  the  accumulation  of  approximation  errors  over 
time,  yielding  time-uniform  error  bounds.  In  the  reminder  of  this  section  we  make 
this  idea  precise  under  certain  (strong)  conditions. 

Recall  that  both  the  filter  and  the  SIR  particle  filter  are  defined  recursively: 

K  :=  Fn  •  •  •  Fi/z,  K  :=  Fn  •  •  •  Fi/i,  n  >  1, 

where  Fn  :=  C„P,  Fn  :=  CnSA  P,  and  ntf  =  it q  =  /z.  The  basic  idea  that  allows  to 
prove  time-uniform  bounds  for  the  bootstrap  particle  filter  is  based  on  the  following 
simple  error  decomposition  [8].  If  we  write  K  —  K  as  a  telescoping  sum: 


n 

%  A  A  /\  /\  /V 

Ttfi  /  y{Fn  ■  ■  ■  Fs_|_i Fs Fs_i  ■  ■  •  F i/z  Fn  •  •  •  Fs+iFsFs_i  •  •  •  Fx/zj, 

s=l 


then  by  the  triangle  inequality  we  get 


7T 


7T: 


(fill  < 


s=l 


|  F  n  ‘  '  '  F  s+i  F<j7T 


ls-l 


F n  '  '  '  Fs+1F,7T 


'  s — 1 1 


(3.5) 


The  .s-th  term  in  this  sum  could  be  interpreted  as  the  contribution  to  the  total  error 
at  time  n  due  to  the  filter  approximation  made  at  time  s.  The  key  insight  is  now  that 
one  can  employ  the  filter  stability  property  to  control  this  sum  uniformly  in  time. 

The  following  theorem  establishes  filter  stability  in  its  simplest  form,  under  a 
certain  ergodicity  assumption  on  the  signal  process  called  mixing  condition.  As  shown 
in  Lemma  2.10,  this  condition  causes  the  signal  (Xn)n>o  itself  to  forget  its  initial 
condition  at  an  exponential  rate,  and  the  following  results  shows  how  the  filter  inherits 
this  property. 
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Theorem  3.7  (Filter  stability,  inheritance).  Suppose  that  the  transition  density  p 
satisfies  the  following  mixing  condition:  there  exists  a  constant  0  <  e  <  1  such  that 

e  <  p(x,  z )  <  e^1  for  all  iy£X. 

Then,  for  any  two  (possibly  random)  probability  measures  p  and  p'  on  (X,  X)  we  have, 
for  n  >  s, 

|||F„  ■  •  •  Fs+iP  -  F n  ■  ■  •  F s+1p,|||  <  2  e~2(l  -  e2)n-s\\\p  -  p' |||. 

Proof.  For  each  1  <  k  <  n  define  the  (random)  transition  kernel 

Kk\n{x,A)  :=  P(Afc  G  A\Xk_i  =  x,Y1:n). 

Proceeding  as  in  the  proof  of  Lemma  3.3,  as  Kk \n{x,  ■ )  is  the  marginal  of  the  distri¬ 
bution  P(A"fc:n  G  •  |Afc_i  =  x,Yi:n)  on  the  Xk  coordinate,  it  is  easy  to  verify  that 

,  _  f  p{x,x')fi>(dx') /3kln(x',Yk+1:n)  g(x',Yk)  lA{x') 

fc|n  X’  f  p(x,x')i/>(dx')  /3k\n(x',Yk+1:n)  g(x',Yk) 

where  j3k\n  can  be  defined  through  the  backward  recursion 

Pk\n(x,  Yk+i-.n)  :=  j  p(x,  x')fi(dx')  g(x',  Yk+1)  (3k+i\n(x',  Yk+ 2:n),  fin\n  :=  1. 

By  the  Markov  property  it  is  easy  to  verify  that  conditionally  on  Y\, ,  Yn  the  random 
variables  Xq,  . . . ,  Xn  follows  the  law  of  a  Markov  chain.  In  fact,  for  each  1  <  k  <  n 
we  have 

P(A)fc  G  A|A0:fe_i,  Fi;n)  =  Kk\n(Xk_i,  A) 

and  for  any  probability  measure  p  on  (X,  X)  and  any  real-valued  measurable  function 
/  we  have 


(Fn  •  •  ’  F \p)f 


j  PP(A0:„  G  dx0:n\Y1:n)  f(xn) 

/n 

Pp(X0  G  dx0\Y1:n)l[pP(Xk  G  dxk\X0: 

k= l 

Po\nK1]n  •  ■  ■  Kn\nf, 


k- 1 


x0:k—  1  j  ^  \:n)  f  (xn) 


where  we  have  defined  p0 \n  :=  Pp(Xq  G  ■  \Yi:n).  By  the  same  argument,  as  Fn  •  •  •  Fi 
and  Fn  •  •  •  Fs+1,  for  any  0  <  s  <  n,  differ  only  in  that  a  different  sequence  of  observa¬ 
tions  (Yi, . . . ,  Yn  versus  Ys+1: . . . ,  Yn)  is  used  in  the  computation  of  these  quantities, 
we  have 

Fn  '  ‘  ‘  F s+ip  Ps\n-k-s+l\n  '  '  '  Xn\ n, 

and  it  is  easy  to  check  that 

f  p(dx)  pa\n(x,Ya+1:n)  lA(x) 

Psln  ’  /  p(dx)  Ps\n{x,  Ys+1:n) 
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Therefore,  by  Lemma  2.10  and  Lemma  2.9  we  have 


III ’  ’  ’  Fs+lP  F n  *  *  *  F s+ip  HI  |||Ps|n-^Cs+l|n  ’  ’  ’  ^n\n  Ps\n^s+l\n  *  *  *  -^n|n||| 

<  (1  —  £2)n~S\\\ps\n  —  p's\n\\\ 

9  SUPa-eX  &|n(%;  ^s+l:n)  ^  _  2\n-s|||  _  /III 

_  tifxex/3a\n{x,Y8+1:n) 

The  proof  is  immediately  concluded  once  we  notice  that  by  the  mixing  conditions  we 
have 

eC  <  f3s\n(x,Ys+i:n)  <  £-1C, 

where 

G  ■ —  J"  Q{x  ,  Ya+1)  /3s_|_i|n(x  ,  Ys+2:n)  • 

□ 


Under  the  mixing  condition  for  the  signal,  Theorem  3.7  tells  us  that  the  filter  for¬ 
gets  its  initial  condition  at  a  geometric  rate.  This  also  means  that  past  approximation 
errors  are  forgotten  at  an  exponential  rate:  if  we  substitute  the  stability  property  in 
the  error  decomposition  (3.5),  we  obtain 


<E2e_2(i 


_2 \n—s  | 


S^s-l 


UU-iui  ^ 


<  2e 


-4 


s= 1 


sup 

n,p 


ip 


Thus,  if  we  can  control  the  error  |||Fnp  —  Fnp|||  in  a  single  time  step,  we  obtain  a 
time-uniform  bound  of  the  same  order.  In  the  case  of  the  bootstrap  particle  filter,  if 
k  <  g(x,y )  <  ft-1,  we  have  that 


|F„p— F.„p|||  =  |||CnPp-CnSJVPp|||  < 


2k 


-2 


Vn' 


and  we  obtain  a  time-uniform  version  of  the  crude  error  bound: 

1 


-4. .-2 


Slip  HI 7T^  —  7t£|||  <4  £  K 


n>0 


Vn' 


Let  us  remark  at  this  point  that  the  basic  error  decomposition  discussed  above 
allows  us  to  separate  the  problem  of  obtaining  time-uniform  bounds  into  two  parts: 
the  one-step  approximation  error  and  the  stability  property.  The  development  of 
these  ingredients  constitutes  the  bulk  of  the  framework  that  is  introduced  in  Chapter 
4  to  deal  with  filtering  problems  in  high  dimension. 

Remark  3.8  (Results  in  the  literature).  In  [15]  Del  Moral  and  Guionnet  prove  several 
time-uniform  error  bounds  for  the  SIR  algorithm,  under  assumptions  on  filter  stability 
that  are  also  weaker  compared  the  one  considered  in  Theorem  3. 7.  Preseritly,  we  limit 
our  treatment  the  basic  ideas  that  are  instrumental  for  the  framework  that  will  be 
developed  in  Chapter  f. 
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3.3.2  The  curse  of  dimensionality. 

While  the  SIR  algorithm  provides  estimates  that  have  error  bounds  uniform  with 
time,  it  turns  out  that  this  algorithm  suffers  severely  from  the  curse  of  dimensionality 
with  respect  to  the  spatial  dimension  of  the  model.  It  is  far  from  obvious  at  this 
point  why  this  should  be  the  case.  Indeed,  the  state  spaces  X  and  Y  have  only  been 
assumed  to  be  Polish  (a  mild  technical  assumption  meant  only  to  ensure  the  existence 
of  regular  conditional  probabilities),  and  no  explicit  notion  of  dimension  appears  in 
the  above  error  bound.  To  understand  why  the  bound 


snp  |||<  -  Kill  < 

n>  0 


c 

Vn 


is  typically  exponential  in  the  model  dimension,  we  must  consider  a  suitable  class  of 
high-dimensional  models  in  which  the  dependence  on  dimension  can  be  explicitly  in¬ 
vestigated.  In  the  present  section  we  consider  a  simple  class  of  trivial  high-dimensional 
models  that  is  useless  in  any  application,  but  is  nonetheless  helpful  for  the  purpose  of 
developing  intuition  for  dimensionality  issues  in  particle  filters.  Moreover,  this  trivial 
class  of  models  represents  the  backbone  of  the  more  realistic  framework  that  will  be 
considered  in  the  next  two  chapters  (see  Section  4.1). 

In  a  d-dimensional  model,  Xn  and  Yn  are  each  described  by  d  coordinates:  X® , 
i  G  {1, . . .  ,d}.  To  construct  a  trivial  d-dimensional  model,  we  simply  start  with  a 
given  one-dimensional  model  and  duplicate  it  d  times.  That  is,  let  (Xn,  Yn)n> 0  be  a 
hidden  Markov  model  on  X  x  Y  with  transition  density  p  and  observation  density  g 
with  respect  to  reference  measures  ^  and  (p,  respectively.  Then  we  set 

X  =  Xd,  Y  =  Yd,  ij)  =  i>®d, 

and 

d 

p(x,  z)  =  \ \p(x\  zl),  g(x,  y )  = 

i—  1  i=l 

so  that  each  coordinate  (X£,  Y£)n> o  is  an  independent  copy  of  (An,  K)n> o-  The 
(trivial)  dependency  structure  of  this  model  is  represented  in  Figure  3.6.  Note  that 
we  have  used  the  term  d-dimensional  in  the  sense  that  our  model  has  d  independent 
degrees  of  freedom:  each  degree  of  freedom  can  itself  in  principle  take  values  in  a 
high-  or  even  infinite-dimensional  state  space  X  x  Y.  This  is,  however,  precisely  the 
notion  of  dimension  that  is  relevant  to  the  curse  of  dimensionality  (in  [4,  IT]  this  idea 
is  sharpened  by  a  notion  of  “effective  dimension”). 


17  9{x\yl), 
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Figure  3.6:  Dependency  graph  of  a  (trivial)  high-dimensional  filtering  model. 


In  this  trivial  setting,  it  is  now  easily  seen  how  the  curse  of  dimensionality  arises  in 
our  error  bound.  Indeed,  let  us  assume  again  for  simplicity  that  k  <  g(x,  y )  <  K~l  for 
some  0  <  k,  <  1.  Then  Kd  <  g(x,y)  <  so  we  obtain  a  bound  that  is  exponential 
in  the  dimension  d  even  after  only  one  time  step: 


7Ti 


7T 


llll< 


y/N  ' 


An  inspection  of  our  bound  clarifies  the  source  of  this  exponential  growth:  even 
though  the  Monte  Carlo  sampling  itself  is  dimension-free  (|||p  —  S^pl  <  X_1//2  inde¬ 
pendent  of  dimension,  see  Lemma  2.17),  the  correction  operator  Cn,  which  is  highly 
nonlinear,  blows  up  the  sampling  error  exponentially  in  high  dimension.  In  particular, 
it  is  evidently  the  dimension  of  the  observations,  rather  than  that  of  the  underlying 
model,  that  controls  the  exponential  growth  in  our  error  bound. 

Of  course,  the  above  analysis  is  far  from  convincing.  First  of  all,  we  have  only 
proved  a  rather  crude  upper  bound  on  the  approximation  error,  so  that  it  might  be 
possible  that  a  more  sophisticated  bound  would  eliminate  the  exponential  depen¬ 
dence  on  dimension  as  was  done  using  the  filter  stability  property  to  eliminate  the 
exponential  dependence  on  time.  Second,  one  could  argue  that  our  strong  notion 
of  approximation  with  respect  to  the  |||  -  |||-norm  is  too  restrictive  to  give  meaningful 
results  in  high  dimension  (which  is  in  fact  the  case:  we  will  later  consider  local  error 
bounds  instead),  so  that  a  weaker  notion  of  approximation  might  avoid  the  expo¬ 
nential  dependence  on  dimension.  Unfortunately,  the  much  more  delicate  analysis  of 
Bickel  et  al.  [4,  47]  demonstrates  conclusively  that  the  curse  of  dimensionality  of  the 
bootstrap  particle  filter  is  a  genuine  phenomenon  and  not  a  mathematical  deficiency 
of  our  analysis,  as  we  will  briefly  explain  presently.  Nonetheless,  both  the  ideas  raised 
above  to  eliminate  the  exponential  dependence  on  dimension  will  play  an  important 
role  in  the  framework  developed  in  Chapter  4. 
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3.3.3  Sample  degeneracy  with  dimension 

The  reason  why  the  SIR  algorithm  performs  poorly  when  the  model  dimension  is 
high  is  essentially  the  same  reason  why  the  SIS  algorithm  behaves  badly  when  the 
time-horizon  is  large,  and  it  has  to  do  with  the  fact  that  the  importance  sampling 
paradigm  is  typically  very  inefficient  in  high- dimensional  models.  As  the  SIS  algo¬ 
rithm  approximates  the  smoothing  distribution  PM(A0:n  €  •  |Ti:„),  the  dimension  of 
interest  in  that  case  is  time:  weight  degeneracy  occurs  as  n  increases2.  On  the  other 
hand,  in  the  current  analysis  of  the  SIR  algorithm  in  the  trivial  model  at  hand,  the 
dimension  of  interest  is  the  number  of  hidden  Markov  chains  in  the  model:  weight 
degeneracy  occurs  as  d  increases,  and  it  is  manifested  even  in  a  single  iteration  of  the 
algorithm,  as  the  following  two  examples  illustrate. 

This  example  represents  the  analog  of  Example  3.4  for  the  SIR  algorithm. 

Example  3.9  (Weight  degeneracy  of  SIR  with  dimension).  In  the  framework  intro¬ 
duced  in  Section  3.3.2,  consider  the  hidden  Markov  model  where  (. Xn)n>0  is  a  sym¬ 
metric  random  walk  in  Zd,  d  >  1,  with  XG  =  x  G  Zd,  and  for  each  n  >  1  we  have 
Yn  =  Xn  +  er]n,  where  e  G  M+  and  (r]n)n> 0  is  a  collection  of  i.i.d.  random  variables 
having  the  standard  Gaussian  distribution  in  (that  is,  zero  mean  and  identity  co- 
variance  matrix).  We  now  look  at  the  first  iteration  of  the  SIR  algorithm.  If  the 
signal-to-noise  ratio  is  high,  that  is,  if  e  is  very  close  to  0,  then  we  expect  the  distri¬ 
bution  Vx(Xi  G  •  |  Yi  —  yi)  to  be  very  concentrated  around  Xi,  the  true  location  of  the 
signal  at  time  1.  However,  if  we  sample  N  particles  from  the  distribution  VX(X\  G  • ), 
then  on  average  only  N/2d  particles  will  be  close  to  X\,  and  the  weight  degeneracy  gets 
exponentially  worse  as  the  dimension  d  increases.  Figure  3. 7  represents  this  scenario. 

The  following  asymptotical  analysis  (in  the  limit  N  — y  oo)  gives  another  quick 
illustration  of  the  degeneracy  in  dimension  of  the  SIR  algorithm.  This  example  is  the 
analog  of  Example  3.5  in  space. 

Example  3.10  (Exponential  growth  of  the  SIR  asymptotic  variance  with  dimension). 
Consider  the  (trivial)  d-dimensional  model  introduced  in  Section  3.3.2.  Let  fi  be  a 
probability  measure  on  X,  and  define  fi  =  fi®d  on  X.  Let  f  be  a  measurable  function 
on  X  such  that  f(x)  =  f(x)  whenever  xe  =  xl ,  for  a  certain  I  G  {1, . . . ,  d}.  Then, 

.  in  distribution 

^^(nff  —  nff)  - >  Gaussian  (0,c^(/))  as  N  — >  oo, 


with 

oftf)  ■=  c(/)7d_1, 

where  c(/)  and  7  are  constants  that  do  not  depend  on  d,  c(/)  >  0  as  long  as  f  is  not 
a  constant,  and  7  >  1  as  long  as  the  observation  density  g  is  different  from  1. 

2Note  that  in  our  analysis  of  the  SIS  algorithm  we  ignored  the  curse  of  dimensionality  with 
respect  to  the  model  dimension.  This  issue  is  exactly  the  same  as  for  the  SIR  algorithm,  as  this  type 
of  weight  degeneracy  already  appears  in  one  iteration  of  the  SIR  particle  filter.  See  Section  3.2.1. 
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Figure  3.7:  Representation  of  the  first  iteration  of  the  SIR  particle  filter  applied 
to  the  hidden  Markov  model  described  in  Example  3.9,  with  N  =  4  particles  and 
x  =  0.  Pictures  (a),  (b)  and  (c)  refer  for  the  case  d  =  1,  while  pictures  (d),  (e)  and 
(f)  refer  for  the  case  d  =  2.  Each  particle  is  represented  by  a  blue  ball,  whose  size 
is  proportional  to  the  weight  of  the  particle,  and  orange  curves  represents  the  level 
sets  of  thelikelihood  function.  As  symbolically  represented,  after  the  first  iteration 
of  the  algorithm  only  an  average  of  N/2d  particles  have  meaningful  weights,  which 
is  a  manifestation  of  the  curse  of  dimensionality. 


For  each  x  G  X,  y  G  Y  define 

g(x,y)  ■= 


g(x,y) 


f  fi(dz)p(z,x)  if  (dx)g(x,y) 

Then,  for  each  N  >  1  we  have 

N-1'2  E ",(/(*!«)  -  </)  nt, S(X f (0, y,‘) 


iV1/2«/  -  </)  = 


if-1  Ef.i  nt.sAfw  ,yf) 


(3,6) 


where  (^(i))^  N  is  a  collection  of  i.i.d.  random  variables  with  distribution 
(p  »)(A)  =  mt  :1y(dzk)p(zk,xk)'i/j(dxk)lA(x),  conditionally  independent  given 
By  independence,  for  each  i  we  have 

d  d 


k= 1 


e  = nEE(9Afw.E‘)iii‘) 

k=\ 

d  n 

=  I1E  /  F(dz)P(z,x)ip(dx)g(x,Yf)  =  !, 


k= 1 

and  i/ie  strong  Law  of  Large  Numbers  yields,  as  N  — >■  oo7 

W  d  ,  ±  ,  d 


n~'  En^fW’b* 


almost  surely 

- 


En»(j¥‘(<)'y*)=i 


*=i  fc=i 


fe=i 


50 


for  the  denominator  in  (3.6).  On  the  other  hand,  as 

/d 

JJ  p(dzk )  p(zk ,  xk)  $( dxk )  g(xk ,  1^')  /(a), 

fc=i 

by  independence  it  is  immediately  verified  that 

d 

E  (/pr,(i))  -  </)  J]  s(A'f  (i),  Yf)  =  0 

k=  1 

and 

(d  \  2 

(/(*■«)  -  </)  n  s(A'f  (i),  y?)  j  =  c(/)  i"-1, 

where 

c(f)  ■=  E  I  jl(dze)p(ze,xl!)'4>(dxe)  (f(x)  -^ffg(x\Y()2, 

7  :=  E  J  fi{dz)p{z,x)fi>{dx)g{x,Y[)2. 

The  Central  Limit  Theorem  yields  that  the  numerator  in  (3.6)  converges  in  distribu¬ 
tion  as  N  — >■  oo  to  a  Gaussian  distribution  with  mean  0  and  variance  cr^(f).  There¬ 
fore,  it  is  immediate  that  also  (3.6)  converges  in  distribution  to  the  same  Gaussian 
distribution.  Applying  Jensen’s  inequality  twice  we  immediately  get  that  7  >  1  as  long 
as  g  is  different  from  1 . 

The  key  obstacle  when  the  observations  are  high-dimensional  is  that  the  posterior 
measure  C np  is  nearly  singular  with  respect  to  the  prior  measure  p  (cf.  Proposition 
2.14).  I11  particular,  a  point  that  has  high  likelihood  under  p  has  likelihood  under  C np 
that  is  exponentially  small  in  the  dimension.  Therefore,  if  we  draw  a  fixed  number 
N  of  samples  from  p,  then  with  very  high  probability  every  one  of  these  samples 
will  have  exponentially  small  likelihood  under  C np  and,  as  is  common  in  rare-event 
scenarios,  the  least  unlikely  sample  will  be  exponentially  more  likely  than  any  of 
the  other  samples.  Thus  CnSN p  will  put  almost  all  its  mass  on  the  sample  with 
the  largest  likelihood,  which  yields  effectively  a  Monte  Carlo  approximation  of  C np 
with  sample  size  1  rather  than  N.  This  situation  is  illustrated  in  Figure  3.8.  This 
weight  degeneracy  phenomenon  rules  out  any  meaningful  form  of  approximation  in 
high  dimension.  In  [4,  47],  a  careful  analysis  shows  that  the  collapse  phenomenon 
occurs  unless  the  sample  size  N  is  taken  to  be  exponential  in  the  dimension,  which 
provides  a  rigorous  statement  of  the  curse  of  dimensionality. 

Remark  3.11.  (Curse  of  dimensionality  and  sample  degeneracy)  Sample  degeneracy 
is  the  manifestation  of  the  curse  of  dimensionality  phenomenon  in  particle  filters,  but 
it  does  not  coincides  with  it.  For  instance,  particle  degeneracy  appears  also  in  low 
dimensional  models  if  the  noise  driving  both  the  dynamics  and  the  observation  is  low 
[61]. 
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(a)  (b) 

Figure  3.8:  Illustration  of  weights  degeneracy  with  model  dimension  in  a  typical 
iteration  of  the  SIR  particle  filter,  (a)  Probability  measures  in  low  dimension,  (b) 
Probability  measures  in  high  dimension  (low-dimensional  representation).  In  high 
dimension  p  and  C np  tend  to  put  mass  on  different  portions  of  the  space.  This  is 
the  reason  why  already  after  a  single  iteration  of  the  SIR  particle  filter  only  a  small 
fraction  of  samples  (in  fact,  a  fraction  that  is  exponentially  small  in  the  dimension) 
is  relevant  in  the  algorithm.  Each  sample  X  from  p  is  represented  by  a  blue  ball 
whose  size  is  proportional  to  the  likelihood  g(X,Yn ),  as  prescribed  by  the  weights 
definition  (3.4). 


Despite  that  the  SIR  particle  filter  suffers  from  the  curse  of  dimensionality  when 
applied  to  the  full  (trivial)  model  of  Section  3.3.2,  it  is  obvious  in  this  case  that  one 
can  surmount  this  problem  in  a  simple  fashion:  as  each  of  the  coordinates  of  the 
high-dimensional  model  is  independent,  one  can  simply  run  an  independent  SIR  filter 
in  each  coordinate.  It  is  evident  that  the  local  error  of  this  algorithm  (that  is,  the 
error  of  the  marginal  of  the  filter  in  each  coordinate)  is,  by  construction,  independent 
of  the  model  dimension  d.  In  this  sense,  this  trivial  model  shows  that  it  is  indeed 
possible  to  filter  very  efficiently  regardless  of  the  ambient  dimension  (though  not  with 
the  SIR  particle  filter,  which  fails  spectacularly).  Chapter  4  builds  on  this  intuition 
by  considering  a  more  general  class  of  models  and  by  developing  a  sampling  strategy 
that  can  overcome  the  weights  degeneracy  with  model  dimension. 

Remark  3.12  (Smoothing  in  high  dimension).  If,  instead  of  computing  the  fil¬ 
ter  Y{Xn  e  ■  | hi, . . .  ,Yn),  we  wish  to  compute  the  full  conditional  path  distribution 
P(X0, . . . ,  Xn  e  ■  |Yi, . . . ,  Yn)  (known  as  the  smoothing  problem),  then  Markov  Chain 
Monte  Carlo  (MCMC)  methods  can  be  successfully  employed  in  high  dimension.  How¬ 
ever,  this  procedure  requires  the  entire  history  of  observations  and  is  not  recursive, 
so  that  it  cannot  be  implemented  on-line  and  is  impractical  over  a  long  time  horizon 
(cf.  [3]).  The  crucial  question  to  be  addressed  is  therefore  whether  it  is  possible  to 
develop  filtering  algorithms  that  are  both  recursive  and  that  admit  error  bounds  that 
are  uniform  in  time  and  in  the  model  dimension. 
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Remark  3.13  (Importance  sampling).  .4s  in  the  case  of  the  SIS  algorithm  (cf. 
Remark  3.6),  also  the  SIR  algorithm  can  be  described  as  an  instance  of  the  self- 
normalized  importance  sampling  paradigm  introduced  in  Section  2.5,  and  different 
importance  distributions  can  be  considered.  While  the  practical  performance  of  the 
SIR  algorithm  can  be  largely  improved  by  working  with  importance  distributions  that 
are  tailored  to  the  specific  model  being  investigated,  the  benefit  is  limited  to  reducing 
the  constants  sitting  in  front  of  the  error  bounds,  and  this  technique  does  not  provide 
a  fundamental  solution  to  the  curse  of  dimensionality.  A  new  paradigm  is  needed,  as 
we  will  see  in  the  next  chapter. 

Presently  we  link  our  formulation  of  the  SIR  algorithm  with  the  one  usually  consid¬ 
ered  in  the  literature  (see  [19]  for  instance).  First  of  all,  notice  the  following  identity 3 * 
which  holds  for  each  n,N  >  1,  and  for  each  probability  measure  p  on  (X,  X): 

CnSNp  =  S”C  np. 

In  fact,  by  definition  of  C„  and  SA  we  have 

r  cN  X/*=l  Yn)  ^x(i)  V/-1\  V/Al ■  •  7  ; 

CnS  p  =  — — —  — ,  ^(1), . .  .,X(N)  are  i.i.d.  samples  ~  p. 

zJi=l  9\X(l),Yn) 

On  the  other  hand,  as  the  Radon-Nikodym  derivative  between  C np  and  p  reads 

d(Cnp)  =  g(x,  Yn ) 
dp  X  f  p(dx )  g(x,  Yn)  ’ 

from  the  definition  of  S(]  (Definition  2.18)  we  have 

cNr  Ef=i  sm  rcN 

P  T.L  ^(A'W) 

Therefore,  the  SIR  algorithm  introduced  in  the  main  text  can  formulated  as  follows: 


importance 

prediction  correction  sampling 

K-l  - >  PK-1  - >  C nPK-l  - >  K  :=  S\Sn P<-1, 

where  the  importance  distribution  is  Xn  = 

The  so-called  “ optimal ”  distribution  is  given  by  the  choice  A*  =  CnP7r((_1.  As 
S[]p.  =  SN p,  this  choice  yields  the  following  algorithm 


prediction 


7r; 


n—  1 


->  PK-i 


correction 


->  CnP^n-l 


sampling 


■+  K  ~ 


To  see  that  this  algorithm  corresponds  to  the  “optimal”  SIR  particle  filter  (cf.  [19]), 
note  that  sampling  from  the  measure  CnP7r£_1;  where  7r^_1  =  can  be 

implemented  as  follows.  Define  two  random  variables  X  and  Z  with  joint  distribution 


M ( Z  G  dz ,  X  e  dx) 


K-i(dz)  P(z , x)  P(dx)  9(x,  Yn) 

f  K-l(dz )  p(z 7  X )  T(dx)  9(x >  Yn)  ’ 


3Here  we  assume  that  the  samples  X(l), . . .  ,X(N)  generated  by  S N p  are  the  same  as  the  samples 

generated  from  S^,  which  is  why  we  speak  of  identity  between  C„S N p  and  Sf  C np. 
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and  note  that  M{ X  G  •)  =  CnP^_1.  To  sample  X  ~  M(X  G  • )  we  can  do  the 
following: 


1.  sample  Z  ~  M(Z  G  dz) 


*n-l(dz)  f  p(z’x)  v(dx)  9(xXn) 

f  *n-ddz)  P(z,x)  V>(dx )  d(xXn) 


a.  sample  X  ~  M(.Y  e  <fa|Z  =  Z)  = 
where  we  have  defined  the  “optimal”  weights 


f  p(Xn_1(i),x')fi(dx)  g(x',Yn) 
Eil  fp(xn- i(i),x')  fi(dx)  g(x',  Yn) 


Even  if  we  were  able  to  sample  from  the  weighted  measure  CnP7r^_1  as  described  above, 
this  would  still  not  resolve  the  curse  of  dimensionality  in  the  filtering  context.  Indeed, 
the  error  between  i =  Fi/i  and  nf  =  Fi/i  would  be  dimension- free,  namely, 

K-#f||  =  l|c1p^-sKc1p,1||<-I=, 


but  the  error  between  nf  =  ^2^1  and  fif  =  would  again  exhibit  exponential 

dependence  on  the  dimension  due  to  the  sampling  performed  in  the  first  time  step. 
The  curse  of  dimensionality  would  therefore  still  arise  due  to  the  recursive  nature  of 
the  filtering  problem  ( see  also  [46]). 
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Chapter  4 


Block  particle  filter 


This  chapter  is  to  develop  the  main  framework  of  local  particle  filters  that  can  over¬ 
come  the  curse  of  dimensionality.  This  is  achieved  by  providing  a  detailed  analysis  of 
the  block  particle  filter  that  we  presently  introduce.  Emphasis  is  given  to  the  decay  of 
correlations  property,  which  is  seen  to  be  the  key  to  establish  spatially  uniform  error 
bounds,  thus  representing  the  spatial  counterpart  of  filter  stability.  The  material  here 
presented  builds  on  the  ideas  introduced  at  the  end  of  the  previous  chapter,  and  it  is 
instrumental  for  the  next  chapter.  This  chapter  is  based  on  the  paper  [40]. 

4.1  Filtering  models  in  high  dimension 

In  order  to  investigate  filtering  problems  in  high  dimension  in  a  systematic  way,  we 
presently  introduce  a  class  of  high-dimensional  filtering  models  that  will  provide  the 
basic  framework  to  be  investigated  throughout  this  chapter  and  the  next  one.  In  these 
models,  the  state  (Xn,Yn)  at  each  time  n  is  a  random  field  (X^,Y^)veV  indexed  by 
a  (finite)  undirected  graph  G  =  (V,  E ).  The  graph  G  describes  the  spatial  degrees  of 
freedom  of  the  model,  and  the  underlying  dynamics  and  observations  are  local  with 
respect  to  the  graph  structure  in  a  sense  to  be  made  precise  below.  The  dimension 
of  the  model  should  be  interpreted  as  the  cardinality  of  the  vertex  set  V,  which  is 
typically  assumed  to  be  large.  Our  aim  is  to  develop  quantitative  results  that  are, 
under  appropriate  assumptions,  independent  of  the  dimension  card!/. 

We  now  define  the  hidden  Markov  model  (Xn,  Yn)n> 0  to  be  considered  in  the  sequel 
(we  will  adopt  throughout  the  basic  setting  and  notation  introduced  in  Section  3.1). 
The  state  spaces  X  and  Y  of  Xn  and  Yn,  and  the  reference  measures  0  and  <p  of  the 
transition  densities  p  and  g,  respectively,  are  of  product  form 

x=n.x-,  y=hy",  )r,  ¥>=0^, 

vG:V  vEV  vEV  vGV 

where  0”  and  ipv  are  reference  measures  on  the  Polish  spaces  X"  and  Yv,  respectively. 
The  transition  densities  p  and  g  are  given  by 

p(x,  z)  =  JJ  pv(x,  zv),  g(x,  y)  =  JJ  gv(xv,  yv ), 

vGV  vGV 
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Figure  4.1:  Dependency  graph  of  a  high-dimensional  filtering  model  of  the  type 
considered  in  this  chapter. 

where  pv  :  X  x  Hv  — »  M+  and  gv  :  Hv  x  Yv  — >  M+  are  transition  densities  with  respect 
to  the  reference  measures  i/)v  and  respectively. 

The  spatial  graph  G  is  endowed  with  its  natural  distance  d  (that  is,  d(v,v')  is 
the  length  of  the  shortest  path  in  G  between  v,v'  G  V).  Let  us  fix  throughout  a 
neighborhood  size  r  G  N,  and  define  for  each  vertex  v  G  V  the  r-neighborhood 

N(v)  :  =  {V  G  V  :  d(v,v')  <  r}. 

We  will  assume  that  the  dynamics  of  the  underlying  process  (. Xn)n>0  is  local  in  the 
sense  that  pv(x,zv )  depends  on  xN^  only  (we  write  xJ  =  (x^)j£j  for  J  C  V)\ 

pv(x ,  zv)  =  pv{x,  zv)  whenever  xN^  =  xN^v\ 

That  is,  the  conditional  distribution  of  given  X0, . . . ,  Xn_\  depends  on  X^-l  only. 
Similarly,  by  construction,  the  observations  are  local  in  that  the  conditional  distribu¬ 
tion  of  Y*  given  Xn  depends  on  X”  only.  This  dependence  structure  is  illustrated  in 
Figure  4.1  (in  the  simplest  case  of  a  linear  graph  G  with  r  =  1). 

Markov  models  of  the  form  introduced  above  appear  in  the  literature  under  various 
names,  such  as  locally  interacting  Markov  chains  or  probabilistic  cellular  automata 
[16,  35].  Such  models  arise  naturally  in  numerous  complex  and  large-scale  applica¬ 
tions,  including  percolation  models  of  disease  spread  or  forest  fires,  freeway  traffic 
flow  models,  probabilistic  models  on  networks  and  large-scale  queueing  systems,  and 
various  biological,  ecological  and  neural  models.  Moreover,  local  Markov  processes 
of  this  type  arise  naturally  from  finite-difference  approximation  of  stochastic  partial 
differential  equations,  and  are  therefore  in  principle  applicable  to  a  diverse  set  of 
data  assimilation  problems  that  arise  in  areas  such  as  weather  forecasting,  oceanog¬ 
raphy,  and  geophysics  (cf.  Section  4.4.4).  While  more  general  models  are  certainly 
of  substantial  interest,  the  model  defined  above  is  prototypical  of  a  broad  range  of 
high-dimensional  data  assimilation  problems  and  provides  a  basic  setting  for  the  in¬ 
vestigation  of  filtering  problems  in  high  dimension. 
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4.2  Decay  of  correlations  and  localization 


As  was  explained  in  Section  3.3.2,  the  SIR  particle  filter  is  not  well  suited  to  address 
high-dimensional  filtering  models:  the  approximation  error  generally  grows  exponen¬ 
tially  in  the  model  dimension  card  V.  However,  in  the  trivial  case  when  the  signal 
dynamics  does  not  couple  neighbors,  that  is,  r  =  1  (this  is  the  analogue  of  the  trivial 
model  introduced  in  Section  3.3.2),  we  know  an  algorithm  that  can  overcome  the 
curse  of  dimensionality:  we  can  simply  run  the  SIR  particle  filter  independently  to 
each  of  the  chains  constituting  the  model.  Clearly,  in  this  way  the  error  bound  per¬ 
taining  each  single  marginal  of  the  model  (that  is,  each  chain)  is,  by  construction, 
independent  of  the  model  dimension. 

When  the  signal  dynamics  couples  neighbors  (r  >  1),  however,  the  law  of  the 
model  at  each  spatial  location  is  no  longer  independent.  Nonetheless,  large-scale 
interacting  systems  can  exhibit  an  approximate  version  of  independence  among  coor¬ 
dinates:  this  is  the  decay  of  correlations  phenomenon  that  has  been  particularly  well 
studied  in  statistical  mechanics  (see,  e.g.,  [27]).  Informally  speaking,  while  the  states 
and  at  two  sites  v,w  G  V  are  probably  quite  strongly  correlated 

when  v  and  w  are  close  together,  one  might  expect  that  (Xf,Y”)  and  (X™,Y™)  are 
nearly  independent  when  v  and  w  are  far  apart  as  measured  with  respect  to  the  nat¬ 
ural  distance  d  in  the  graph  G.  The  idea  is  that  due  to  the  decay  of  correlations,  also 
in  the  care  r  >  1  the  model  can  be  “locally  low-dimensional”,  in  the  sense  that  the 
conditional  distribution  of  each  coordinate  only  needs  to  be  updated  by  observations 
in  a  neighborhood  whose  size  is  independent  of  the  ambient  dimension.  Roughly 
speaking,  the  “local  dimension”  of  the  model  is  the  number  of  coordinates  in  a  ball 
whose  radius  is  the  correlation  length  of  the  filtering  distribution. 

As  seen  in  Section  3.3.1,  the  sampling  step  added  to  the  original  filter  recursion  is 
the  key  to  exploit  algorithmically  filter  stability  and  get  particle  filters  (i.e.,  the  SIR 
particle  filter)  that  yield  time-uniform  error  bounds.  In  this  chapter  we  will  demon¬ 
strate  that  proper  forms  of  localization  of  the  filter  recursion  can  be  used  to  exploit 
algorithmically  the  decay  of  correlations  property  and  to  design  local  particle  filters 
that  yield  error  bounds  that  are  uniform  both  in  time  and  in  the  model  dimension. 

A  speculative  back-of-the-envelope  computation  explains  how  this  might  work. 
Due  to  the  decay  of  correlations,  the  conditional  distribution  of  the  site  Xf  given 
the  new  observation  Yn  should  not  depend  significantly  on  observations  Y™  at  sites 
w  distant  from  v.  Suppose  we  can  develop  a  local  particle  filtering  algorithm  that 
at  each  site  v  only  uses  observations  in  a  local  neighborhood  K  of  v  to  update  the 
filtering  distribution.  As  we  have  now  restricted  to  observations  in  K,  the  sampling 
error  at  each  site  will  be  exponential  only  in  card  K  rather  than  in  the  full  dimension 
card  V .  On  the  other  hand,  the  truncation  to  observations  in  K  is  only  approximate: 
the  decay  of  correlations  property  suggests  that  the  bias  introduced  by  this  truncation 
should  decay  exponentially  in  diarn  K.  Therefore, 


gCard  K 

error  =  bias  +  variance  ~  e~  diam  K  -} - 

Vn 
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If  the  size  of  the  neighborhoods  K  is  chosen  so  as  to  optimize  the  error,  then  the 
resulting  algorithm  is  evidently  consistent  (with  a  slower  convergence  rate  than  the 
standard  I/a/ZV  Monte  Carlo  rate:  this  is  likely  unavoidable  in  high  dimension)  with 
an  error  bound  that  is  independent  of  the  model  dimension  card  V. 


4.3  Block  particle  filter 


In  this  chapter  we  will  investigate  in  detail  the  simplest  possible  local  particle  fil¬ 
tering  algorithm  that  can  exploit  decay  of  correlations  properties  of  the  underlying 
filtering  model,  the  block  particle  filter.  While  this  algorithm  possesses  some  inherent 
limitations  (see  Section  4.4.3  below),  it  is  the  simplest  local  algorithm  both  mathe¬ 
matically  and  computationally,  and  therefore  provides  an  ideal  starting  point  for  the 
investigation  of  particle  filters  in  high  dimension. 

To  define  the  block  particle  filtering  algorithm,  we  begin  by  introducing  a  partition 
X  of  the  vertex  set  V  into  nonoverlapping  blocks:  that  is,  we  have 

V=  U  K,  K  n  K'  =  0  for  K  f  K' ,  K.  R"  e  X. 

Kex 

We  now  define  the  blocking  operator 

Bp  :=  0  B Kp, 

Kex 


where  for  any  measure  p  on  X  =  &)v£V  X^  and  J  C  V  we  denote  by  BJp  the  marginal 
of  p  on  &)veJ  XC  The  random  field  described  by  the  measure  Bp  on  X  is  independent 
across  different  blocks  defined  by  the  partition  X,  while  the  marginal  on  each  block 
agrees  with  the  original  measure  p.  The  block  particle  filter  inserts  an  additional 
blocking  step  into  the  SIR  particle  filter  recursion:  that  is, 


K  ~  Ab  K=  FnTTn-l 

where  Fn  :=  CnBSAP  consists  of  four  steps 

prediction  sampling 

7rn_i  y  P7rn_i  y 


blocking 


>  BS^PtL 


correction 
71—1  ^ 


(n  >  1), 


S^P^n-l 

7Tn  :=  C„BSa  P7r„_i. 


The  resulting  algorithm  is  given  in  Figure  4.2.  Figure  4.3  illustrates  a  typical  iteration 
of  the  algorithm.  In  the  special  case  X  =  {V},  the  block  particle  filter  reduces  to 
the  SIR  particle  filter,  so  that  the  former  is  a  strict  generalization  of  the  latter  (we 
have  therefore  not  introduced  a  separate  notation  for  the  SIR  particle  filter:  in  this 
chapter,  the  notation  i always  refers  to  the  block  particle  filter). 

The  introduction  of  independent  blocks  allows  to  localize  the  algorithm,  which 
will  be  crucial  in  the  high-dimensional  setting.  We  can  immediately  see  this  fact  if 
we  apply  the  block  particle  filter  to  the  trivial  model  obtained  with  r  —  1:  choosing 


Algorithm  3:  Block  particle  filter 


Data:  Fix  n,  N  >  1.  Let  the  observations  Li, . . .  ,Yn  be  given. 

Let  7Tq  =  fi\ 

for  k  —  1, . . . ,  n  do 

Sample  i.i.d.  Zk-i(i),  i  —  1, . . .  ,N  from  the  distribution  7T^_1; 
Sample  Xk(i)  ~  pv(Zk-i(i),  ■ )  dijjv,  i  =  1, . . . ,  N,  v  G  V; 

Compute  W«(i)  =  ^ . *  6  * 


_  Let  tt"  =  ®K63C  EZ i  Wf  (<)  ^ 

Compute  the  approximate  filter  7r^/  ~  7 t£f. 


Figure  4.2:  The  block  particle  filtering  algorithm  considered  in  this  chapter. 


Figure  4.3:  Representation  of  a  single  iteration  of  the  block  particle  filter,  with 
X  =  Y  =  R2+  and  N  =  6.  Each  particle  is  represented  by  a  ball,  whose  size 
is  proportional  to  the  weight  of  the  particle,  (a)  Representation  of  -| .  (b) 

Resampling  step:  N  particles  are  sampled  independently  with  replacement  and 
weights  are  reset  to  1/N.  (c)  Particles  are  propagated  forward  using  the  underlying 
dynamics,  (d)  Blocking  step:  grey  balls  represent  the  “ghost”  particles  that  are 
generated  by  shuffling  the  coordinates  of  the  existing  N  particles  (blue  balls),  (e) 
Particles  are  reweighed  according  to  the  likelihood  of  the  new  observation  at  time 
n  (whose  level  sets  are  drawn  in  orange)  yielding  7rn. 
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%  =  (J,ugy{u}  the  algorithm  reduces  to  applying  the  SIR  particle  filter  independently 
to  each  of  the  chains  constituting  the  model;  that  is,  we  recover  the  original  algorithm 
that  motivated  our  discussion  in  the  first  place. 

The  rest  of  the  chapter  is  devoted  to  showing  that  the  localization  procedure 
introduced  by  the  blocking  step  can  indeed  overcome  the  curse  of  dimensionality 
even  in  the  more  realistic  case  of  a  coupled  dynamics,  r  >  1  (proofs  are  provided  in 
Appendix  A).  Note  that  in  this  case  the  blocking  step  introduces  some  bias  in  the 
algorithm,  so  that  the  estimates  given  by  the  block  particle  filter  do  not  converge  to 
the  exact  filter  distributions  as  the  number  of  particles  N  goes  to  infinity.  However, 
the  hope  is  that  by  introducing  a  small  amount  of  bias  in  the  algorithm,  its  variance 
can  be  reduced  significantly. 

In  fact,  it  is  immediately  evident  from  inspection  of  the  block  particle  filtering 
algorithm  that  only  observations  in  block  K  are  used  by  the  algorithm  to  update  the 
filtering  distribution  in  block  K .  Therefore,  following  the  heuristic  ideas  discussed  in 
the  Section  4.2,  we  expect  that  the  sampling  error  of  the  algorithm  is  exponential  in 
card  K  rather  than  in  the  model  dimension  card  V.  To  control  the  bias  introduced 
by  the  blocking  step,  note  that  the  blocking  operator  Bp  decouples  the  distribution 
p  at  the  boundaries  of  the  blocks.  The  decay  of  correlations  property  (if  it  can 
be  established)  should  cause  the  influence  of  such  a  perturbation  on  the  marginal 
distribution  at  a  vertex  v  G  K  to  decay  exponentially  in  the  distance  from  v  to  the 
boundary  of  the  block  K.  Thus  the  back-of-the-envelope  computation  in  Section 
4.2  applies  to  the  local  error  at  “most”  vertices,  as  the  boundaries  of  the  blocks 
only  constitute  a  small  fraction  of  the  total  number  of  vertices.  On  the  other  hand, 
the  error  will  necessarily  be  larger  for  vertices  closer  to  the  block  boundaries.  This 
spatial  inhomogeneity  of  the  local  error  is  an  inherent  limitation  of  the  block  particle 
filter  that  one  might  hope  to  alleviate  by  the  development  of  more  sophisticated  local 
particle  filters.  We  postpone  further  discussion  of  this  point  to  Section  4.4.3. 

Remark  4.1  (On  distributed  computing).  By  their  nature,  local  particle  filtering 
algorithms,  such  as  the  block  particle  filter  here  considered,  are  well  suited  to  dis¬ 
tributed  computation:  as  the  particles  are  updated  locally  in  the  spatial  graph,  this 
opens  the  possibility  of  implementing  each  local  neighborhood  on  a  separate  processor. 
While  this  was  not  the  original  intention  of  the  algorithms  we  propose,  such  properties 
could  prove  to  be  advantageous  in  their  own  right  for  the  practical  implementation  of 
filtering  algorithms  in  very  large-scale  systems. 


4.4  Main  result:  error  bounds  uniform  in  the  di¬ 
mension 

Having  introduced  the  block  particle  filtering  algorithm,  we  now  proceed  to  formulate 
the  main  result  of  this  chapter  (Theorem  4.2  below). 

Recall  that  we  have  introduced  the  neighborhoods 

N(v)  :=  {V  G  V  :  d(v,v')  <  r} 
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above,  where  the  neighborhood  size  r  is  fixed  throughout  this  chapter  (in  our  model, 
the  state  of  vertex  v  depends  only  on  the  states  of  vertices  in  N(v)  in  the  previous 
time  step).  Given  a  set  J  C  V,  we  denote  the  r-inner  boundary  of  J  as 

dJ  :=  {v  G  J  :  N(v)  %  J} 

(that  is,  dJ  is  the  subset  of  vertices  in  J  that  can  interact  with  vertices  outside  J  in 
one  step  of  the  dynamics).  We  also  define  the  following  quantities: 

I  DC  loo  :=  max  card  K, 

1  1  Kex 

A  :=  max  cardin'  G  V  :  d(v,v')  <  r}, 

v£V 

Ax  :=  maxcardjiC'  G  DC  :  d(K,  K')  <  r|, 

Kex 

where  we  define  as  usual  d(J,  J')  :=  min„ej  min„/ej/  d(v,v')  for  J,  J'  C  V.  Thus  |DC|oo 
is  the  maximal  size  of  a  block  in  DC,  while  A  (Ax)  is  the  maximal  number  of  vertices 
(blocks)  that  interact  with  a  single  vertex  (block)  in  one  step  of  the  dynamics.  It 
should  be  emphasized  that  r,  A  and  Ax  are  local  quantities  that  depend  on  the 
geometry  but  not  on  the  size  of  the  spatial  graph  G. 

Finally,  we  introduce  for  J  C  V  the  local  distance 

IIIp-pIIIj  :=  sup  \/e  \p(f)  —  p'(/)|2 

between  random  measures  p,  p'  on  X,  where  XJ  denotes  the  class  of  measurable  func¬ 
tions  /  :  X  — >■  K.  such  that  f(x)  =  f(x)  whenever  xJ  =  xJ . 

Theorem  4.2  (Block  particle  filter,  main  result).  There  exists  a  constant  0  <  Eq  <  1, 
depending  only  on  the  local  quantities  A  and  Ax,  such  that  the  following  holds. 
Suppose  there  exist  £q  <  e  <  1  and  0  <  n  <  1  such  that 

e  <  pv(x,zv)  <  e-1,  k  <  gv(xv,yv)  <  k"1  Vn  G  V,  x,  z  G  X,  y  G  Y. 


Then  for  every  n  >  0,  x  G  X,  K  G  X  and  J  C  K  we  have 


\nn  ~  ff^H  j  <  a  card  J 


,-j9i  d(J,dK)  +  & 


Vn 


where  the  constants  0  <  a,/3 1 ,  /32  <  oo  depend  only  on  £,  k,  r,  A,  and  Ax- 


The  key  point  of  this  result  is  that  both  the  assumptions  and  the  resulting  error 
bound  depend  only  on  local  quantities.  In  particular,  the  assumptions  and  error 
bound  depend  neither  on  time  n  nor  on  the  model  dimension  card  V. 

Remark  4.3  (On  the  assumptions  of  Theorem  4.2).  A  threshold  requirement  of  the 
form  £>  £q  is  essential  in  order  to  obtain  the  decay  of  correlations  property:  the  decay 
of  correlations  can  fail  if  £  >  0  is  too  small  (a  phenomenon  known  as  phase  transition 
in  statistical  mechanics) .  Otherwise,  the  assumptions  of  Theorem  f.2  are  comparable 
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to  assumptions  commonly  imposed  in  the  literature  to  obtain  error  bounds  for  the  SIR 
particle  filter  [8,  15]  and  possess  similar  limitations.  We  postpone  a  discussion  of 
these  issues  to  Section  4-4-1  below.  Let  us  also  note  that  explicit  expressions  for  the 
constants  in  Theorem  4-2  can  be  read  off  from  the  proofs;  however,  we  do  not  believe 
that  our  methods  are  sufficiently  sharp  to  yield  practical  quantitative  results. 

Remark  4.4  (Dependence  on  observations).  The  particle  filter  i tlf  depends  both  on 
the  random  samples  that  are  drawn  in  the  algorithm  and  on  the  random  sequence  of 
the  observations.  However,  the  randomness  of  the  observations  plays  no  role  in  our 
proofs.  One  can  therefore  interpret  the  expectation  in  the  definition  of  |[|  •  |||  ^  as  being 
taken  only  with  respect  to  the  random  sampling  mechanism  in  the  block  particle  filter, 
and  the  bound  of  Theorem  4-2  as  holding  uniformly  with  respect  to  the  observation 
sequence. 

Remark  4.5  (Initial  measure).  In  Theorem  4-2  we  have  considered  tt xn  and  ir*  with 
a  non-random  initial  condition  x  £  X.  This  is  a  choice  of  convenience:  the  proof 
of  Theorem  4-2  yields  the  same  conclusion  for  more  general  initial  conditions  that 
satisfy  a  suitable  decay  of  correlations  property.  On  the  other  hand,  the  stability 
property  of  the  filter  ( e.g .,  Corollary  A. 5)  ensures  that  i r%  forgets  its  initial  condition 
fi  exponentially  fast  uniformly  in  the  dimension,  so  there  is  little  loss  of  generality  in 
choosing  a  computationally  convenient  initial  condition. 

To  provide  a  concrete  illustration  of  Theorem  4.2,  we  consider  in  the  remainder 
of  this  section  the  example  where  the  spatial  graph  G  is  a  square  lattice,  that  is, 

V  =  {-d,...,d}q  (d,qeN) 

endowed  with  its  natural  edge  structure.  Note  that  in  this  case,  the  graph  distance 
d(v,  v ')  is  simply  the  ^-distance  between  the  corresponding  vectors  of  integers.  To 
define  the  partition  X,  we  cover  V  by  blocks  of  radius  b  £  N:  that  is, 

X  =  {(x  +  {-6, . . . ,  b}q)  n  V  :  x  £  {2b  +  1)Z9}. 

We  assume  for  simplicity  in  the  sequel  that  b  >  r,  and  that  (2d  +  1)/ (26  +  1)  £  N  is 
integer  so  that  all  K  £  X  are  translates  of  {— 6, . . .  ,b}q  (this  slightly  simplifies  our 
arguments  below  but  is  not  essential  to  our  results).  We  can  easily  compute 

\X\00  =  (2b+l)q,  A  <  (2r  +  l)9,  Ax<3*. 

Note  that  these  local  quantities  do  not  depend  on  the  size  d  of  our  lattice.  In  a  data 
assimilation  application  one  might  have,  for  example,  q  —  2,  r  —  1,  d  ~  103. 

Consider  the  block  K  =  {— b, . . . ,  b}q.  Note  that  for  u  —  0, . . . ,  b  —  r 

[v  £  K  :  d(v,  dK)  >  u}  =  {  —  (b  —  r  —  u), . . .  ,b  —  r  —  u}q. 

Fix  0  <  <5  <  1  and  choose  u  =  [6(26  +  l)/2 q  —  rj .  Then 

card{v  £  K  :  d(v,  dK)  >  u]  f  2(6  —  r  —  v)  —  1 
card  K  \  26  +  1 

where  we  have  used  1  —  (1  —  S)1^  >  S/q.  The  same  conclusion  evidently  holds  for 
every  block  K  £  X.  Thus  Theorem  4.2  gives  the  following  corollary. 


>1  —  6, 
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Corollary  4.6.  In  the  square  lattice  setting  V  =  {—d, . . . ,  d}q,  there  exists  a  constant 
0  <  £0  <  l,  depending  only  on  r  and  q,  such  that  the  following  holds. 

Suppose  there  exist  e0  <  e  <  1  and  0  <  k  <  1  such  that 

e  <  pv(x,  zv)  <  £-1,  k  <  gv(xv,  yv)  <  k_1  Vu  e  V,  x,  z  e  X,  y  e  Y. 


Then  for  every  x  G  X;  n  >  0,  and  0  <  5  <  1  we  have 


card  {  v  e  V  :  |||<  -  n*\\\v  <  aT  ,  -  ^ 


p0'2(2b+l)i  ) 

>e-i3\5(2b+\)  _|_  a> -  —  L  >  (1  —  5)  card  V, 


where  the  constants  0  <  a' ,  j3[,  f5'2  <  oo  depend  only  on  e,  k,  r,  and  q. 

In  particular,  if  we  choose  the  block  size  b  =  (4/?2)— 1//g  log1/9  N  —  then 

card  {v  e  V  :  |||<  -  7T^|||„  <  cie-C25Iogl/97V}  >  (1  -  5)  cardC 

and 

1  \^lll7T3:  7TX  III  <  C3 

ca*dV^  n  n  log^iV’ 

where  the  constants  0  <  ci,C2,C3  <  oo  depend  only  on  e,  n,  r,  and  q. 

Corollary  4.6  makes  precise  the  notion  that  a  properly  timed  block  particle  filter 
can  avoid  the  curse  of  dimensionality:  choosing  the  block  size  b  rv_/  log1/9  N,  we  obtain 
a  local  error  that  can  be  made  arbitrarily  small,  uniformly  both  in  time  n  and  in 
the  lattice  size  d,  by  choosing  a  sufficiently  large  sample  size  N.  More  precisely, 
we  see  that  the  local  error  at  most  locations  (i.e.,  on  an  arbitrarily  large  fraction  of 
the  graph)  is  of  order  e-clogl  QN,  which  is  polynomial  for  q  —  1  and  subpolynomial 
otherwise.  The  bound  for  the  average  local  error  is  similarly  uniform  in  n  and  d, 
albeit  with  a  very  slow  convergence  rate.  It  appears  that  these  results  are  chiefly 
limited  by  the  spatial  inhomogeneity  that  is  inherent  in  the  block  particle  filtering 
algorithm,  as  will  be  discussed  in  Section  4.4.3  below. 

Remark  4.7.  We  have  stated  the  local  error  in  Corollary  f.6  in  terms  of  one¬ 
dimensional  marginals  |||7r*  —  n *\\\v  for  simplicity;  an  analogous  result  can  be  obtained 
for  marginals  over  cubes  of  any  fixed  size  —  7f^|||u+r_s 

Remark  4.8.  Theorem  f.2  and  Corollary  f.6  should  be  viewed  as  a  theoretical  proof 
of  concept  that  it  is  possible,  in  principle,  to  design  particle  filters  that  avoid  the  curse 
of  dimensionality.  In  practice,  the  slow  rate  b  ~  log1/9  N  suggests  that  the  block  size 
must  typically  be  quite  small  (of  order  unity)  for  realistic  values  of  the  sample  size  N , 
which  yields  a  large  bias  term  in  our  bounds.  We  have  nonetheless  observed  in  simple 
simulations  that  the  algorithm  can  work  quite  well  even  with  the  choice  b  =  0,  so  that 
the  practical  utility  of  the  algorithm  may  not  be  fully  captured  by  our  mathematical 
results.  Moreover,  specific  features  of  certain  data  assimilation  applications,  such  as 
sparsity  of  observations,  could  make  it  possible  to  choose  substantially  larger  blocks. 
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A  systematic  investigation  of  the  empirical  performance  of  local  particle  filtering  al¬ 
gorithms  in  applications  is  beyond  the  scope  of  our  analysis,  however.  The  practical 
implementation  of  local  particle  filters  for  data  assimilation  will  likely  require  further 
advances  in  all  mathematical,  methodological  and  applied  aspects  of  high- dimensional 
filtering. 

In  the  next  three  sections  we  discussion  the  main  aspects  of  Theorem  4.2. 

4.4.1  Mixing  assumptions  and  the  ergodicity  threshold 

The  basic  assumption  of  Theorem  4.2  is  that  the  local  transition  densities  are  bounded 
above  and  below: 


£<pv(x,zv)<£  1,  k  <  gv(xv,  yv)  <  k  1. 

This  is  a  local  counterpart  of  the  mixing  assumptions  that  are  routinely  employed  in 
the  analysis  of  particle  filters  [8,  15].  The  global  mixing  assumption  e  <  p(x,  z )  <  £_1 
would  imply  that  the  underlying  Markov  chain  is  strongly  ergodic  (in  the  sense  that  its 
transition  kernel  is  a  strict  contraction  with  respect  to  the  total  variation  distance, 
cf.  Lemma  2.10)  and  is  often  used  to  establish  the  stability  property  of  the  filter 
(cf.  Theorem  3.7).  This  is  essential  to  obtain  a  time- uniform  bound  on  the  particle 
filter  error,  see  Section  3.3.1  and  Section  4.5.1  below.  The  local  mixing  assumption 
e  <  pv(x,zv )  <  £~1  employed  here  should  similarly  be  viewed  as  a  local  ergodicity 
assumption  on  the  model. 

It  is  well  known  that  strong  mixing  assumptions  of  this  type  impose  some  con¬ 
straints  on  the  underlying  model.  In  particular,  strong  mixing  assumptions  often 
require  a  compact  state  space:  in  a  noncompact  state  space  the  likelihood  ratio 
p(x,z)/p(x',z)  is  typically  unbounded  as  |z|  — >  oo  (this  is  readily  verified  in  lin¬ 
ear  Gaussian  models,  for  example),  while  £  <  p(x,z)  <  £_1  would  imply  that 
p(x,  z)/p(x',  z)  is  uniformly  bounded.  Similarly,  the  assumptions  of  Theorem  4.2 
will  typically  only  hold  in  models  whose  local  state  spaces  Hv  and  Y"  are  compact. 
While  qualitative  results  in  this  area  have  been  obtained  in  much  more  general  set¬ 
tings  (cf.  [52]  and  the  references  therein),  it  has  proved  to  be  more  difficult  to  obtain 
quantitative  results  under  assumptions  weaker  than  strong  mixing  conditions:  it  re¬ 
mains  an  open  problem,  for  example,  to  obtain  quantitative  time-uniform  bounds 
under  mild  ergodicity  assumptions  even  for  the  approximation  error  of  the  SIR  par¬ 
ticle  filter.  These  technical  issues  are  however  unrelated  to  the  problems  that  arise  in 
high  dimension,  and  we  do  not  address  them  here. 

On  the  other  hand,  there  is  a  crucial  assumption  in  Theorem  4.2  that  does  not 
arise  in  finite  dimension.  In  classical  results  on  particle  Liters,  it  is  assumed  that 
£  <  p(x,  z)  <  £— 1  with  £  >  0.  For  the  local  assumption  £  <  pv(x,  zv )  <  £_1,  however, 
it  is  not  sufficient  to  assume  that  £  >  0;  we  must  assume  that  £  >  £0  for  some  strictly 
positive  threshold  £o  >  0.  Some  assumption  of  this  form  is  absolutely  essential  in 
the  high-dimensional  setting.  Unlike  the  global  mixing  assumption,  the  local  mixing 
assumption  is  not  in  itself  sufficient  to  ensure  that  the  underlying  model  will  remain 
ergodic  as  the  dimension  cardU  — >■  oo:  the  cumulative  effect  of  the  interactions  can 
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create  long-range  correlations  that  break  both  ergodicity  and  any  decay  of  correlations 
properties.  Typically,  the  model  is  ergodic  when  the  mixing  constant  £  is  sufficiently 
large,  but  ergodicity  breaks  abruptly  as  e  drops  below  a  threshold  value  e0.  Such 
phenomena,  called  phase  transitions  in  statistical  mechanics,  are  very  common  in 
large-scale  interacting  systems:  see  [35,  16]  for  a  number  of  examples.  When  the 
underlying  model  fails  to  exhibit  ergodicity  and  decay  of  correlations,  we  lack  the 
mechanism  that  we  aim  to  exploit  by  developing  local  particle  filters.  Therefore, 
some  assumption  of  the  form  £  >  £0  is  essential  in  Theorem  4.2  in  order  to  ensure  the 
presence  of  decay  of  correlations. 

Unfortunately,  the  actual  constant  £0  that  arises  in  the  proof  of  Theorem  4.2  is 
almost  certainly  far  from  optimal.  The  Dobrushin  machinery  (Theorem  2.11)  that 
forms  the  basis  of  our  proof  already  does  not  yield  sharp  estimates  of  the  phase 
transition  point  even  in  the  simplest  classical  models  of  statistical  mechanics.  It  is 
also  far  from  clear  whether  the  block  particle  filter  should  necessarily  possess  the 
same  phase  transition  point  as  the  underlying  model:  it  may  be  that  the  algorithm 
only  works  in  a  strict  subset  of  the  regime  in  which  the  underlying  model  possesses 
the  decay  of  correlations  property.  The  mathematical  tools  used  in  this  chapter  are 
not  sufficiently  powerful  to  address  much  more  delicate  questions  of  this  type.  The 
practical  relevance  of  Theorem  4.2  is  therefore  of  a  qualitative  nature — we  show  that 
the  block  particle  filter  can  beat  the  curse  of  dimensionality  above  a  certain  phase 
transition  point — but  should  not  be  relied  upon  to  provide  quantitative  guidance 
in  specific  situations.  It  remains  of  substantial  interest  to  weaken  the  assumptions 
of  Theorem  4.2  and  to  obtain  sharper  quantitative  results;  further  progress  in  this 
direction  will  require  the  development  of  a  more  sophisticated  probabilistic  toolbox 
for  the  investigation  of  filtering  problems  in  high  dimension. 

It  should  be  noted  that  the  problems  investigated  in  this  chapter  are  closely  related 
to  fundamental  properties  of  conditional  distributions.  We  have  implicitly  taken  for 
granted  that  the  filter  will  be  stable  when  the  underlying  model  is  ergodic  (and 
similarly  for  the  decay  of  correlations  property),  but  it  is  far  from  obvious  that  such 
properties  are  in  fact  preserved  under  conditioning  on  the  observations.  While  the 
inheritance  of  ergodic  properties  under  conditioning  can  be  proved  in  a  very  general 
setting  for  models  with  finite-dimensional  observations  (see  [52]  and  the  references 
therein),  we  will  see  in  Chapter  7  that  there  exist  surprising  examples  in  infinite 
dimension  where  the  filter  is  non-ergodic  even  though  the  underlying  model  is  ergodic 
and  nondegenerate.  Such  probabilistic  phenomena  remain  poorly  understood.  The 
threshold  assumption  £  >  £q  rules  out  such  issues  in  the  setting  of  this  chapter. 

4.4.2  Ergodicity  in  space  and  time 

The  intuition  behind  the  block  particle  filtering  algorithm  is  that  the  localization  con¬ 
trols  the  sampling  error  (as  it  replaces  the  model  dimension  card  V  by  the  block  size 
|3C|oo),  while  the  decay  of  correlations  property  of  the  model  controls  the  localization 
error  (as  it  ensures  that  the  effect  of  the  localization  decreases  in  the  distance  to  the 
block  boundary).  This  intuition  is  clearly  visible  in  the  conclusion  of  Theorem  4.2. 
It  is  however  not  automatically  the  case  that  our  model  does  indeed  exhibit  decay  of 
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correlations:  when  there  are  strong  interactions  between  the  vertices,  phase  transi¬ 
tions  can  arise  and  the  decay  of  correlations  can  fail  much  as  for  standard  models  in 
statistical  mechanics  [35],  in  which  case  we  cannot  expect  to  obtain  dimension- free 
performance  for  the  block  particle  filter.  Such  phenomena  are  ruled  out  in  Theo¬ 
rem  4.2  by  the  assumption  that  e  <  pv  <  for  £  >  £o,  which  ensures  that  the 
interactions  in  our  model  are  sufficiently  weak. 

It  is  notoriously  challenging  to  obtain  sharp  quantitative  results  for  interacting 
models,  and  it  is  unlikely  that  one  could  obtain  realistic  values  for  the  constants  in 
Theorem  4.2  at  the  level  of  generality  considered  here.  More  concerning,  however,  is 
that  the  weak  interaction  assumption  of  Theorem  4.2  is  already  unsatisfactory  at  the 
qualitative  level,  as  decay  of  correlations  in  space  and  time  are  treated  on  the  same 
footing:  as  £  — >  1,  both  the  spatial  and  temporal  correlations  disappear.  Note  that 
there  is  no  interaction  between  the  vertices  in  the  extreme  case  e  =  1;  the  assumption 
£  >  £o  should  be  viewed  as  a  perturbation  of  this  situation  (i.e.,  weak  interactions). 
However,  setting  £  =  1  turns  off  not  only  the  interaction  between  different  vertices, 
but  also  the  interaction  between  the  same  vertex  at  different  times:  in  this  setting  the 
dynamics  of  the  model  become  trivial.  In  contrast,  one  would  expect  that  it  is  only 
the  strength  of  the  spatial  interactions,  and  not  the  local  dynamics,  that  is  relevant 
for  dimension-free  errors,  so  that  Theorem  4.2  places  an  unnatural  restriction  on  our 
understanding  of  block  particle  filters. 

It  is  therefore  of  interest  to  separate  the  temporal  and  spatial  ergodicity  assump¬ 
tions,  for  example,  by  replacing  the  assumption  £  <  pv(x,  zv )  <  e-1  by  an  assumption 
of  the  form  £qv(xv,zv)  <  pv(x,zv )  <  £~1qv(xv,  zv)  that  only  controls  the  spatial  in¬ 
teractions,  where  the  transition  density  qv  describes  the  local  dynamics  at  the  vertex 
v  in  the  absence  of  interactions.  Rather  than  assuming  pv(x,zv )  «  1  as  in  Theorem 
4.2,  we  would  like  to  assume  only  that  the  spatial  interactions  are  weak  in  the  sense 
that  pv(x ,  zv)  «  qv(xv,  zv). 

Overcoming  this  deficiency  behind  Theorem  4.2  requires  the  development  of  more 
refined  comparison  theorems  than  the  Dobrushin  comparison  theorem  that  is  used 
repeatedly  for  the  results  presented  in  this  chapter  (see  Section  4.5.2  below).  This 
new  toolbox  is  of  its  own  interest,  and  it  will  be  the  subject  of  Chapter  6.  The 
analysis  of  the  block  particle  filter  on  the  basis  of  the  new  comparison  theorems  will 
yield  Theorem  6.13,  which  improves  qualitatively  Theorem  4.2. 


4.4.3  Local  algorithms  and  spatial  homogeneity 

The  major  drawback  of  the  block  particle  filtering  algorithm  is  the  spatial  inhomo¬ 
geneity  of  the  bias.  As  was  explained  in  Section  4.3,  the  block  particle  filter  introduces 
errors  at  the  block  boundaries.  We  will  increase  the  size  of  the  blocks  as  the  number 
of  particles  N  increases,  so  that  more  points  are  distant  from  the  block  boundaries 
and  therefore  benefit  from  the  decay  of  correlations.  Nonetheless,  points  near  the 
boundary  will  always  be  subject  to  larger  errors,  and  we  can  only  hope  to  implement 
the  intuition  of  Section  4.2  to  spatial  locations  that  are  strictly  in  the  interior  of  the 
blocks. 
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The  consequences  of  this  inhomogeneity  are  manifested  quantitatively  in  Corollary 
4.6.  Near  the  block  boundaries,  Theorem  4.2  gives  a  bound  of  order  unity.  By  exclud¬ 
ing  a  small  fraction  of  spatial  locations,  however,  we  eliminate  the  block  boundaries 
to  retain  an  error  of  order  e~cl°g1/qN  at  “most”  spatial  locations: 

card  jv  G  V  :  |||7T®  —  n^\\\v  <  e~c5l°g1/q  ^  j  >  (1  —  S)  card  V. 

If,  on  the  other  hand,  we  compute  the  spatial  average  of  the  error,  we  obtain  an 
exceedingly  slow  convergence  rate  that  is  much  worse  than  the  “typical”  rate: 


Note  that  the  block  boundaries  constitute  a  fraction  ~  1/b  of  spatial  locations,  where 
b  is  the  block  size;  therefore,  as  6  ~  log1/9  IV  in  Corollary  4.6,  we  see  that  the  error 
at  the  block  boundaries  dominates  our  bound  on  the  average  error. 

The  behavior  of  the  errors  described  above  seems  to  be  an  inherent  limitation  of 
the  block  particle  filtering  algorithm.  It  is  therefore  of  significant  interest  to  explore 
the  possibility  that  one  could  develop  alternative  local  particle  filtering  algorithms 
that  are  spatially  homogeneous.  Conceptually,  as  explained  in  Section  4.2,  such 
an  algorithm  should  update  the  filtering  distribution  at  each  site  v  using  sites  in  a 
centered  neighborhood  Nf,(v)  :=  {v'  e  V  :  d(v,v')  <  b };  the  decay  of  correlations 
should  then  yield  a  bias  that  decays  exponentially  in  b.  In  this  case,  we  would  expect 
to  obtain  a  spatially  uniform  error  bound  of  the  form 


SUPlIK-^nllL 

vev 


<  g-cl°g  1/qN 


for  the  optimized  neighborhood  size  b  ~  log1/9  N .  Whether  it  is  in  fact  possible  to 
design  a  local  particle  filtering  algorithm  that  attains  such  a  uniform  error  bound  is 
still  an  open  question.  Chapter  5  is  devoted  to  discussing  one  possible  idea  that  could 
be  of  interest  in  this  setting. 


4.4.4  High-dimensional  models  in  data  assimilation 

The  basic  model  that  we  have  introduced  in  Section  4.1  is  prototypical  of  many  data 
assimilation  problems  and  provides  a  particularly  convenient  mathematical  setting  for 
the  investigation  of  filtering  problems  in  high  dimension.  While  such  models  could  be 
directly  relevant  to  many  high-dimensional  applications,  there  remains  a  substantial 
gap  between  relatively  simple  models  of  this  form  and  realistic  models  used  in  the 
most  complex  applications,  particularly  in  the  geophysical,  atmospheric  and  ocean 
sciences,  that  frequently  consist  of  coupled  systems  of  partial  differential  equations. 
The  investigation  of  such  complex  problems,  and  the  associated  numerical,  physical, 
and  practical  issues,  is  far  beyond  the  scope  of  this  thesis.  We  therefore  restrict  our 
discussion  of  such  problems  to  a  few  brief  comments. 

In  principle,  discrete  models  as  defined  in  Section  4.1  arise  naturally  as  finite- 
difference  approximations  of  stochastic  partial  differential  equations  with  space-time 
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white  noise  forcing.  As  the  resulting  state  spaces  Xv  are  not  compact,  such  systems 
cannot  satisfy  strong  mixing  assumptions  (cf.  Section  4.4.1),  but  this  is  likely  a  math¬ 
ematical  rather  than  a  practical  problem.  More  importantly,  it  is  not  clear  whether 
the  discretized  models  will  be  in  the  regime  of  decay  of  correlations  (that  is,  above 
the  phase  transition  point)  even  if  the  original  continuum  model  possesses  such  prop¬ 
erties.  It  is  possible  that  this  requirement  would  place  constraints  on  the  spatial  and 
temporal  discretization  steps,  in  the  spirit  of  the  von  Neumann  stability  criterion  in 
numerical  analysis.  The  physics  of  such  problems  could  also  impose  constraints  on 
the  design  of  local  particle  filters;  for  example,  it  is  suggested  in  [60,  p.  4107]  that 
discontinuities  (such  as  might  be  introduced  at  the  block  boundaries  in  the  block 
particle  filtering  algorithm)  could  generate  spurious  gravity  waves  in  ocean  models. 
Such  numerical  and  practical  issues  are  distinct  from  the  fundamental  problems  in 
high  dimension  that  we  aim  to  address  in  this  thesis,  but  can  ultimately  play  an 
equally  important  role  in  complex  applications. 

Let  us  also  note  that  models  considered  in  the  data  assimilation  literature  are  often 
deterministic  partial  differential  equations  without  stochastic  forcing;  the  only  ran¬ 
domness  in  such  models  comes  from  the  initial  condition  (cf.  [34,  1]).  In  deterministic 
chaotic  dynamical  systems,  it  is  impossible  to  obtain  time-uniform  approximations 
using  classical  particle  Liters  as  there  is  no  dissipation  mechanism  for  approximation 
errors  (the  Liter  cannot  be  stable  in  this  case,  cf.  Section  4.5.1).  This  issue  is  not  di¬ 
rectly  related  to  dimensionality  issues  in  particle  Liters:  such  problems  arise  in  every 
deterministic  Lltering  problem.  It  is  natural  to  regularize  deterministic  systems  by 
adding  dynamical  noise  to  the  model  (there  is  an  extensive  literature  on  random  per¬ 
turbations  of  chaotic  dynamics,  see  for  example  [6]);  a  similar  observation  has  been 
made  by  practitioners  in  the  context  of  ad-hoc  Lltering  algorithms,  cf.  [34,  section  5]. 
To  our  knowledge,  a  rigorous  analysis  of  such  ideas  in  the  setting  of  particle  Liters 
has  yet  to  be  performed. 

4.5  Outline  of  the  proof:  framework  behind  local 
particle  filters 

In  this  section  we  discuss  the  outline  of  the  proof  of  the  main  result  of  this  chapter, 
Theorem  4.2.  While  this  discussion  is  tailored  to  the  analysis  of  the  block  particle 
Liter,  the  ideas  here  developed  constitute  the  backbone  of  a  more  general  framework 
that  encompasses  a  new  philosophy  behind  Lltering  in  high  dimension.  The  details 
of  the  proof  of  Theorem  4.2  will  then  be  given  in  Appendix  A. 

4.5.1  Error  decomposition 

The  goal  of  Theorem  4.2  is  to  bound  the  error  between  the  Liter  7 and  the  block 
particle  Liter  7 f%.  Recall  that  both  the  Liter  (Section  3.1)  and  block  particle  Liter 
(Section  4.3)  are  deLned  recursively: 

K  =  Fn  •  •  •  F i/i, 
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K  =  Fn--  •  Fi n, 


where  F„  :=  CnP  and  Fn  :=  C„BSA  P.  We  introduce  also  the  block  filter 

K  :=  Fn  •  •  •  F i/j, 

with  Fn  :=  C„BP.  By  the  triangle  inequality,  we  have 
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The  first  term  on  the  right-hand  side  quantifies  the  bias  introduced  by  the  projection 
on  independent  blocks,  while  the  second  term  quantifies  the  error  due  to  the  variance 
of  the  random  sampling  in  the  algorithm.  Each  term  will  be  bounded  separately  to 
obtain  the  two  terms  in  the  error  bound  of  Theorem  4.2. 

The  challenges  encountered  in  bounding  the  bias  term  (cf.  Section  4.5.3)  and  the 
variance  term  (cf.  Section  4.5.4)  are  quite  different  in  nature.  Nonetheless,  both 
bounds  are  based  on  a  basic  scheme  of  proof  that  was  invented  in  order  to  prove 
time- uniform  bounds  for  the  SIR  particle  filter  [15,  8],  see  Section  3.3.1.  We  therefore 
begin  by  reviewing  this  general  idea,  which  is  based  on  a  simple  error  decomposition. 

Suppose  for  sake  of  illustration  that  we  aim  to  bound  directly  the  error  between 
7 and  7 f£.  The  basic  idea  is  to  write  7 —  tt%  as  a  telescoping  sum: 
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By  the  triangle  inequality, 

F  s+iF  s'ks-i  -  F  „  •  FH-1Fs7r(t_i|||. 

The  term  s  in  this  sum  could  be  interpreted  as  the  contribution  to  the  total  error  at 
time  n  due  to  the  filter  approximation  made  at  time  s. 

The  key  insight  is  now  that  one  can  employ  the  filter  stability  property  to  control 
this  sum  uniformly  in  time.  In  its  simplest  form,  this  property  can  be  proved  in  the 
following  form  (see  Theorem  3.7):  if  £  <  p(x,  z )  <  e^1  for  all  x,  z  G  X,  then 

|||Fn  ■  ■  •  F s+1p  -  F n  ■  ■  ■  Fs+lp'\\\  <  2 £-2(l  -  £2r-s||| p  -  p' |||. 
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Thus,  the  filter  forgets  its  initial  condition  at  an  exponential  rate.  However,  this 
also  means  that  past  approximation  errors  are  forgotten  at  an  exponential  rate:  if  we 
substitute  the  stability  property  in  the  above  error  decomposition,  we  obtain 
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Thus,  if  we  can  control  the  error  |||Fnp  —  Fnp|||  in  a  single  time  step,  we  obtain  a 
time-uniform  bound  of  the  same  order.  In  the  case  of  the  SIR  particle  filter,  if 
k  <  g(x,y )  <  k-1,  we  proved  in  Section  3.1  that  |||Fnp  —  Fnp|||  <  2 k,~2/\/N. 


69 


The  basic  error  decomposition  discussed  above  allows  us  to  separate  the  problem 
of  obtaining  time-uniform  bounds  into  two  parts:  the  one-step  approximation  error 
and  the  stability  property.  It  is  important  to  note,  however,  that  both  parts  become 
problematic  in  high  dimension.  We  have  already  seen  (Section  3.3.2)  that  the  one-step 
approximation  error  of  the  SIR  particle  filter  is  exponential  in  the  model  dimension; 
we  will  surmount  this  problem  by  working  with  the  block  particle  filtering  algorithm 
and  performing  a  local  analysis  of  the  one-step  error  using  the  decay  of  correlations 
property  (which  must  itself  be  established).  On  the  other  hand,  the  filter  stability 
bound  used  above  also  becomes  exponentially  worse  in  high  dimension:  a  local  bound 
of  the  form  e  <  pv(x,zv )  <  e^1  only  yields  £cardV  <  p(x,z)  <  £-cardV;  which  is 
exponential  in  the  model  dimension  cardVL  To  surmount  this  problem,  we  must 
develop  a  much  more  precise  understanding  of  the  filter  stability  property  in  high 
dimension,  which  proves  to  be  closely  related  to  the  decay  of  correlations  property. 
The  development  of  these  ingredients  constitutes  the  bulk  of  the  proof  of  Theorem 
4.2. 

4.5.2  Dobrushin  comparison  method 

How  can  one  control  the  approximation  error  of  high-dimensional  distributions?  The 
basic  idea  that  we  aim  to  exploit,  both  algorithmically  and  mathematically,  is  that 
the  decay  of  correlations  property  leads  to  a  form  of  localization:  the  effect  on  the 
distribution  in  some  spatial  set  J  of  a  perturbation  made  in  another  set  J'  decays 
rapidly  in  the  distance  Therefore,  as  long  as  we  measure  the  error  locally 

(in  |||- 1  j  rather  than  ||H||),  one  would  hope  to  control  the  spatial  accumulation  of 
approximation  errors  much  as  we  controlled  the  accumulation  of  approximation  errors 
in  time  using  the  filter  stability  property. 

The  Dobrushin  comparison  theorem  (Theorem  2.11)  introduced  in  Section  2.4  is 
the  tool  that  will  allow  us  to  characterize  the  crucial  way  in  which  the  decay  of  cor¬ 
relations  property  enters  the  picture.  In  the  current  setting,  a  useful  manifestation  of 
the  decay  of  correlations  property  is  that  the  matrix  D  from  the  comparison  theorem 
is  such  that  Dt]  decays  exponentially  in  the  distance  d(i,j).  If  this  is  in  fact  the  case, 
then  Theorem  2.11  yields,  for  example,  ||p  —  p\\i  <  where  bj  measures 

the  local  error  at  site  j  between  p  and  p  (in  terms  of  the  conditional  distributions  p? 
and  p>).  The  decay  of  correlations  property  therefore  controls  the  accumulation  of 
local  errors  much  as  one  might  expect. 

Let  us  now  explain  how  Theorem  2.11  will  be  applied  in  the  filtering  setting.  For 
sake  of  illustration,  consider  the  problem  of  obtaining  a  local  filter  stability  bound: 
that  is,  we  would  like  to  bound  \\tt*  —  n*\\j  for  x,  x  G  X  and  J  C  V.  It  would  seem 
natural  to  apply  Theorem  2.11  directly  with  /  =  V,  §  =  X,  and  p  =  n*,  p  =  7TX. 
This  is  not  useful,  however,  as  we  do  not  know  how  to  control  the  corresponding  local 
quantities  such  as  pvz  =  PX(XX  e  •  | Vi, . . . ,  Yn,  Xn  '^  =  zy\W). 

Instead,  define  I  =  {0, . . . ,  n}  x  V  and  §  =  Xn+1,  and  let 

p  =  Px(X0,...,Xne  -|Fi,...,Fn), 
p  =  Px(X0,...,Xne  •  |Yi, . . . ,  Yn). 
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As 


IK  -<ll J  =  IIP-PIIWxJ, 

we  can  now  apply  Theorem  2.11  to  the  smoothing  distributions  p,  p.  Unlike  the  filters 
7 r^,  irf,  however,  p  and  p'  are  Markov  random  fields  on  /  (cf.  Figure  4.1),  so  that  the 
conditional  distributions  pk,v  and  pkz,v  can  be  easily  computed  and  controlled  in  terms 
of  the  local  densities  pv(x,zv )  and  gv(xv,yv).  For  example,  as 

/n 

lA(x,  x1}  ,  xn)  JJ  Yl  Pv(xk- 1,  xl)  9v(xk->  Yk)  V(dxvk), 

k= i  vev 

and  as  pv(xk-i,  xvk)  depends  only  on  xk_x  for  d(w,v )  <  r,  we  obtain 


A^B)  cx  i B(4)pv(*-I,4)Bv(4,Yn  n  Pw(^,^+i)r(d4) 

wEN(v) 


for  0  <  k  <  n  and  v  G  V  (the  proportionality  is  up  to  a  normalization  factor). 
We  will  repeatedly  exploit  expressions  of  this  type  to  obtain  explicit  bounds  on  the 
quantities  Cij  and  bj  that  appear  in  Theorem  2.11.  It  should  be  emphasized  that 
pkz,v  is  a  genuinely  local  quantity:  the  product  inside  the  integral  contains  at  most 
ca,rdN(v)  <  A  terms.  We  will  consequently  be  able  to  use  Theorem  2.11  to  obtain 
bounds  that  do  not  depend  on  the  model  dimension  card  V. 

Remark  4.9.  In  the  language  of  statistical  mechanics,  we  exploit  the  fact  that  the 
smoothing  distribution  PX(A0, . . . ,  Xn  e  •  \Y\, . . . ,  Yn)  is  a  Gibbs  measure  [27]  on  the 
space-time  index  set  I  —  {0, . . . ,  n}  x  V .  Similar  insight  has  proved  to  be  fruitful  in 
the  ergodic  theory  of  large-scale  interacting  Markov  chains,  cf.  [35]. 


4.5.3  Bounding  the  bias:  decay  of  correlations 

To  bound  the  bias  \\irf  —  7f*||  j,  we  follow  the  basic  error  decomposition  scheme  de¬ 
scribed  above:  that  is, 

n 

IK  -  Kh  <  E  llF«  •  ■  •  Fs+1FK-1  -  F n  •  •  •  f.+Ktt^iu. 

S=1 

To  implement  our  program,  we  must  now  obtain  suitable  local  bounds  on  the  stability 
of  the  filter  and  on  the  one-step  approximation  error.  Both  these  problems  will  be 
approached  by  application  of  the  Dobrushin  comparison  theorem. 

In  its  most  basic  form,  one  can  prove  a  filter  stability  property  of  the  following 
type:  provided  e  >  £o,  there  exists  /3  >  0  (depending  only  on  A  and  r)  such  that 

||  Fn  ■  ■  •  F s+\p  -  Fn  •  •  •  Fs+H|  J  <  4  card  J  e~0^ 

for  any  probability  measures  /i,  v  on  X  and  J  Q  V,  n  >  0  (cf.  Corollary  A. 5).  This 
bound  is  evidently  dimension-free,  unlike  the  crude  filter  stability  bound  described 
in  Section  4.5.1.  Nonetheless,  this  filter  stability  bound  would  yield  a  trivial  result 


71 


when  substituted  in  the  error  decomposition,  as  it  does  not  provide  any  control  in 
terms  of  the  distance  between  /i  and  v  (and  therefore  in  terms  of  the  one-step  error). 
Instead,  we  will  prove  in  Section  A.l  the  local  stability  bound 


F„  ■  ■  ■  F s+ifJ>  -  F n  •  ■  •  Fs+M\j  <  2e-0^  ^ 


maxe 

v'ev 


where  Dv>(fj,,u)  is  a  suitable  measure  of  the  local  error  between  /i  and  u  at  site  v' 
that  arises  naturally  from  the  Dobrushin  comparison  theorem  (see  Proposition  A. 2 
for  precise  expressions).  This  filter  stability  bound  is  genuinely  local:  the  stability 
on  the  spatial  set  J  C  V  depends  predominantly  on  the  local  distance  of  the  initial 
conditions  near  J  (that  is,  the  spatial  accumulation  of  errors  is  mitigated).  This 
localization  comes  at  a  price,  however;  the  local  filter  stability  bound  holds  only  if 
the  initial  condition  /i  satisfies  a  priori  a  decay  of  correlations  property. 

Once  the  local  filter  stability  bound  is  substituted  in  the  error  decomposition,  it 
remains  to  prove  a  bound  on  the  one-step  error  Dv ( F,s7tJ_ , .  F<i7rf_1)  with  respect  to  the 
local  distance  prescribed  by  the  filter  stability  bound.  This  will  be  done  in  Section 
A. 2:  we  will  show  that  for  a  constant  C  that  depends  only  on  A ,r,e, 


for  every  K  G  X  and  v  G  K,  provided  again  that  /i  satisfies  a  priori  a  decay  of 
correlations  property.  This  is  precisely  what  we  expect:  as  B  only  introduces  errors 
at  the  block  boundaries,  the  decay  of  correlations  should  ensure  that  the  error  at  site 
v  decays  exponentially  in  the  distance  to  the  nearest  block  boundary.  The  Dobrushin 
comparison  theorem  allows  to  make  this  intuition  precise. 

The  decay  of  correlations  property  evidently  plays  a  dual  role  in  our  setting: 
it  controls  the  approximation  error  of  the  block  filter,  which  is  the  basic  principle 
behind  the  block  particle  filtering  algorithm;  at  the  same  time,  it  mitigates  the  spatial 
accumulation  of  approximation  errors,  which  is  essential  for  proving  dimension-free 
bounds.  In  order  to  apply  the  above  bounds,  the  key  step  that  remains  is  to  prove 
that  the  appropriate  decay  of  correlations  property  does  in  fact  hold,  uniformly  in 
time,  for  the  block  filter  The  latter  will  be  shown  in  Section  A. 3  by  iterating  a 
one-step  decay  of  correlations  bound  that  is  obtained  once  again  using  the  Dobrushin 
comparison  theorem.  We  conclude  by  putting  together  all  these  ingredients  in  Section 
A. 4  to  obtain  a  bound  on  the  bias  of  the  form 


CcardJe-W^ 


for  J  C  K  (Theorem  A. 12).  This  proves  the  first  half  of  Theorem  4.2  (note  that, 
as  the  bias  does  not  depend  on  the  random  sampling  in  the  block  particle  filtering 
algorithm,  we  can  trivially  replace  ||7T*  —  tt*\\ j  by  |||7T®  —  7T®||| 7  in  this  bound). 
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Figure  4.4:  For  a  linear  spatial  graph  G  partitioned  into  blocks  A-E  (with  r  =  1), 
the  dependencies  between  the  blocks  at  subsequent  times  are  illustrated  here.  The 
left  dependency  graph  represents  Bc  P the  right  graph  represents  Bc  PBP^.  The 
blocking  operation  unravels  the  original  graph  into  a  tree  by  introducing  indepen¬ 
dent  duplicates  (dotted  boxes)  of  blocks  in  the  previous  time  step. 


4.5.4  Bounding  the  variance:  the  computation  tree 

To  bound  the  variance  term  ||| 7r^  —  7r®|||j,  we  once  again  start  from  the  basic  error 
decomposition 


Fs+1Fsnxs_1-Fn---Fs+1Fsn*_1\\\J. 

The  difficulties  encountered  in  controlling  this  expression  are  quite  different  in  nature, 
however,  than  what  was  needed  to  control  the  bias  term. 

Dimension-free  bounds  on  the  bias  exploit  decay  of  correlations:  the  core  difficulty 
is  to  obtain  local  control  of  the  error  inside  the  blocks.  The  variance  term,  on  the  other 
hand,  will  already  grow  exponentially  in  the  size  of  the  blocks  due  to  the  exponential 
dependence  of  the  sampling  error  on  the  dimension  of  the  observations.  There  is 
therefore  no  need  bound  the  error  on  a  finer  scale  than  a  single  block.  This  makes  the 
analysis  of  the  variance  much  less  delicate  than  controlling  the  bias,  and  it  is  indeed 
not  difficult  to  obtain  a  variance  bound  of  the  right  order  on  a  finite  time  horizon 
(but  growing  exponentially  in  time  n). 

The  chief  difficulty  in  controlling  the  variance  is  to  obtain  a  time-uniform  bound. 
Note  that,  in  the  error  decomposition  for  the  variance  term,  it  is  not  stability  of  the 
filter  7 that  enters  the  picture  but  rather  stability  of  the  block  filter  7 r^.  Unlike 
the  filter,  however,  which  has  by  construction  an  interpretation  as  the  marginal  of  a 
smoothing  distribution,  the  block  filter  is  defined  by  a  recursive  algorithm  and  not  as 
a  conditional  expectation,  ft  is  therefore  not  entirely  obvious  how  one  could  adapt 
the  approach  outlined  in  Section  4.5.2  to  this  setting. 

The  key  idea  that  will  be  used  to  establish  stability  is  that  the  block  filter  can 
nonetheless  be  viewed  as  the  marginal  of  a  suitably  defined  Markov  random  held, 


Tn 


\j< 


Eiih 


S=1 
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just  like  the  filter  can  be  viewed  as  the  marginal  of  a  smoothing  distribution.  This 
random  held,  however,  lives  on  a  much  larger  index  set  than  the  original  model. 
The  basic  idea  behind  the  construction  is  illustrated  in  Figure  4.4  (disregarding  the 
observations  for  simplicity  of  exposition).  When  we  apply  the  transition  operator  P, 
each  block  interacts  with  its  A^  neighbors  in  the  previous  time  step.  However,  if 
we  subsequently  apply  the  blocking  operator  B,  then  each  block  is  replaced  by  an 
independent  copy.  This  could  be  modelled  equivalently  by  introducing  independent 
duplicates  of  the  blocks  in  the  previous  time  step,  and  having  each  block  interact 
with  its  own  set  of  duplicates.  This  unravels  the  original  dependency  graph  into  a 
tree.  By  iterating  this  process,  we  can  express  the  block  filter  as  the  marginal  of  a 
Markov  random  held  defined  on  a  tree  that  contains  many  independent  duplicates  of 
each  block.  We  call  this  construction  the  computation  tree  in  analogy  with  a  similar 
notion  that  arises  in  the  analysis  of  belief  propagation  algorithms  [50]. 

With  this  construction  in  place,  we  can  now  obtain  a  stability  bound  for  the  block 
filter  by  applying  the  Dobrushin  comparison  theorem  to  the  computation  tree.  This 
will  be  done  in  Section  A. 5  to  obtain  a  bound  of  the  following  form:  provided  e  >  £q, 
there  exist  /3,/3'  >  0  (depending  only  on  A,  A %,r)  such  that 

max  ||  Fn  •  ■  •  Fs+i p  -  Fn  ■  •  •  Vs+1u\\K  <  e^aoe~^n~8')  max  ||/iA  -  uK  || 


for  any  pair  of  initial  conditions  of  product  form  /i  =  RA  >  v  c  l'K  (cf- 

Corollary  A.  16).  Combining  this  bound  with  the  error  decomposition,  we  obtain  in 
Section  A. 6  a  time-uniform  bound  on  the  variance  term  of  the  form 

e/3'|3C  |oo 

where  we  bound  the  one-step  error  in  the  same  spirit  as  the  computation  for  the  SIR 
particle  filter  in  Section  3.1  (however,  a  more  involved  argument  is  needed  here  to 
surmount  the  fact  that  the  block  filter  stability  bound  is  given  in  a  total  variation 
norm  rather  than  the  weaker  norm  |||-|||A')-  Thus  Theorem  4.2  is  proved. 

Remark  4.10  (Alternative  error  decomposition).  The  reason  we  must  consider  sta¬ 
bility  of  the  block  filter  is  that  we  have  first  split  the  error  into  the  bias  ||| 7r^  —  7r*|||  7 
and  variance  p if  —  7r^|||  3  parts,  and  then  applied  the  error  decomposition  to  each  term 
separately.  One  might  hope  to  circumvent  the  problem  by  applying  the  error  decom¬ 
position  directly  to  the  total  error  |||7r*  —  7r®|||  3  as  was  illustrated  in  Section  4-5.1,  and 
then  splitting  the  one-step  error  terms  in  this  bound  into  bias  and  variance  parts: 

|||f"n  ’  ‘  ‘  fs+lfsK-1  ’  ’  '  Fs+1  Fs7Ts_i  |||  j 

<  |||Fn  •  •  •  Fs+iFgTT^!  -  Fn  •  ■  ■  F s+iF s7r^_1|||  J 
+  |||F„  •  •  •  Fs+iFstta_i  —  Fn  •  •  •  F^iF^-illlj. 

In  this  case,  only  stability  of  the  filter  is  needed  to  control  the  error  accumulation. 

Unfortunately ,  using  this  simpler  approach  it  is  impossible  to  obtain  a  nontrivial 
bound  on  the  bias.  Indeed,  to  control  the  one-step  bias  Dv( Fs/z,  Fsp),  it  is  essential  that 


max  IK  -  K\\ \K  <  c 
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p,  satisfies  a  decay  of  correlations  property.  In  Section  f.5.3,  the  error  decomposition 
required  us  to  obtain  such  a  bound  for  p  =  and  we  showed  that  the  block  filter 

does  indeed  possess  the  requisite  decay  of  correlations  property.  On  the  other  hand,  if 
we  apply  the  error  decomposition  to  the  total  error  as  above,  one  would  have  to  obtain 
such  a  bound  for  p  =  7rJ_ j  .  This  is  impossible,  as  rtf_l  cannot  possess  a  useful  decay 
of  correlations  property  within  the  blocks. 

To  see  this,  consider  what  happens  when  we  apply  the  Dobrushin  comparison  the¬ 
orem  to  an  empirical  measure  p  =  jjJ2k=i^k  w'dh  Xk  i.i.d.  ~  v.  Suppose  that 
*  =  ®iei  vl  for  some  (nonatomic)  measures  1/ :  this  is  the  extreme  case  where  v  has 
no  spatial  correlations  at  all.  Nonetheless,  the  empirical  measure  p  will  be  maximally 
correlated:  as  each  Xlk  is  distinct  with  unit  probability,  we  obtain  plx  =  6X>  for  every 
X  G  {X\, . . .  ,Xn},  so  that  Cij  =  1  for  every  i  j  in  Theorem  2.11.  We  therefore 
see  that  sampling  destroys  decay  of  correlations  (this  is,  in  essence,  the  same  phe¬ 
nomenon  that  causes  the  curse  of  dimensionality  of  particle  filters).  For  this  reason, 
it  is  essential  to  consider  the  bias  and  variance  terms  separately. 


75 


76 


Chapter  5 


Localized  Gibbs  sampler  particle 
filter 


This  chapter  is  devoted  to  introducing  a  particle  filter  algorithm  that  implements 
a  spatially  homogeneous  localization  to  overcome  the  curse  of  dimensionality,  hence 
addressing  the  main  drawback  of  the  block  particle  filter  analyzed  in  the  previous 
chapter.  While  a  complete  analysis  of  this  algorithm  is  still  missing,  we  prove  a 
one-step  error  bound  for  the  bias  term  that  illustrates  the  mechanism  that  can 
provide  spatially  homogenous  approximations  of  the  filter  distribution.  The  goal  of 
this  chapter  is  also  to  show  that  the  general  idea  of  local  particle  filters  is  much 
broader  than  is  suggested  by  the  block  particle  filtering  algorithm,  and  that  the 
mathematical  analysis  developed  in  this  thesis  could  in  itself  provide  inspiration  for 
further  methodological  developments.  The  material  presented  in  this  chapter  is  new 
and  has  not  been  submitted  to  publication  yet. 

Henceforth,  we  assume  to  work  in  the  same  setting  introduced  in  Section  4.1. 


5.1  Motivations 

The  block  particle  filter  was  introduced  in  Section  4.3  by  localizing  the  SIR  particle 
filter  recursion  Ttn  =  CnSA  P7Tn_i  to  nn  =  CnBSA  P7rn_i,  via  the  blocking  operator 
B  that  projects  probability  measures  to  the  product  of  their  marginals  over  a  fix 
partition  %  of  the  vertex  set  V. 

At  the  heart  of  our  main  result  (Theorem  4.2)  lies  the  decay  of  correlations.  In  the 
proofs  there  we  used  an  intuitive  notion  of  decay  of  correlations  of  essentially  the  fol¬ 
lowing  form:  a  probability  measure  p  on  X  =  n^ev  ^  possesses  the  decay  of  correla¬ 
tions  property  if  the  effect  on  the  conditional  distribution  p(Xv  G  •  =  aA\M) 

of  a  perturbation  to  xv  decays  exponentially  in  the  distance  d(v,  v')  (cf.  Sections  4.5.2 
and  A.l).  The  blocking  operation  evidently  replaces  these  conditional  distributions 
by 

(B p)(Xv  G  A|Xy\w  =  =  p(Xv  G 
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for  every  K  G  X  and  v  G  K.  Therefore,  if  p  possesses  the  decay  of  correlations 
property,  then  the  bias  at  site  v  G  K  incurred  by  the  blocking  operation  decays 
exponentially  in  the  distance  between  v  and  the  boundary  of  K .  On  the  other  hand, 
the  sampling  error  depends  only  on  the  dimension  of  the  block,  and  not  on  the 
dimension  of  the  entire  system. 

As  we  discussed  in  Section  4.4  (particularly  in  Section  4.4.3),  the  major  draw¬ 
back  of  the  block  particle  filtering  algorithm  is  precisely  the  spatial  inhomogeneity 
of  the  bias,  as  the  blocking  introduces  errors  at  the  block  boundaries:  points  near 
the  boundaries  will  always  be  subject  to  larger  errors.  On  the  one  hand,  it  is  true 
that  by  optimizing  the  error  bound  in  Theorem  4.2  we  find  that  the  size  of  the  blocks 
increases  as  the  number  of  particles  N  increases,  so  that  more  points  are  distant 
from  the  block  boundaries  and  therefore  benefit  from  the  decay  of  correlations.  On 
the  other  hand,  our  theory  suggests  that  the  size  of  the  blocks  typically  increases 
slowly  (logarithmically)  with  the  number  of  particles  (see  Corollary  4.6  for  a  concrete 
example),  so  that  we  should  not  consider  large  blocks. 

From  this  perspective,  an  approach  to  spatially  homogeneous  algorithms  readily 
suggests  itself:  we  should  aim  to  replace  B  with  another  operator  M  that  satisfies 

(Mp){Xv  G  =  xnw)  =  p(Xv  G  =  a^M\M) 

for  every  v  G  V,  where  Nb(v)  :=  {V  G  V  :  d(v,v')  <  b}.  The  bias  incurred  by 
this  operation  decays  exponentially  in  b  uniformly  for  all  v  (it  is  therefore  spatially 
homogeneous).  On  the  other  hand,  as 

(CnMPp)(Xv  G  =  a;nw)  = 

f  lA(xv)  gv{xv,Y”)  U^mv)PW(z,x-)p(dz)r(dxv) 
fgv(xv,Ynv)  El u,eNb(v)Pw(z,xw)p(dz)ii’v(dxv)  ’ 

the  sampling  error  incurred  if  we  replace  p  by  S N p  in  this  expression  should  only  be 
exponential  in  card-/V&(v)  (which  is  ~  bq  for  the  square  lattice)  rather  than  in  the 
model  dimension  card  V.  This  suggests  that  the  local  particle  filter  defined  by  the  re¬ 
cursion  Fn  =  SjVC„MP  should  yield  a  spatially  homogeneous  algorithm  in  accordance 
with  our  intuition. 

To  implement  this  algorithm  one  needs  to  sample  from  the  measure  C„MPp,  which 
we  have  defined  only  implicitly  in  terms  of  its  conditional  distributions.  However,  this 
is  precisely  the  task  to  which  Markov  chain  Monte  Carlo  (MCMC)  methods  are  well 
suited.  These  methods  sample  from  a  probability  measure  by  constructing  a  Markov 
chain  that  has  the  desired  measure  as  its  equilibrium  distribution.  In  particular,  the 
Gibbs  sampler  (Section  5.2  below)  is  a  MCMC  method  that  implements  this  paradigm 
by  using  transition  kernels  that  are  defined  in  terms  of  the  conditional  distributions 
of  the  desired  measure. 

One  would  therefore  ostensibly  obtain  a  spatially  homogeneous  local  particle  fil¬ 
tering  algorithm  that  is  recursive  in  time  and  that  uses  MCMC  to  sample  the  spatial 
degrees  of  freedom  (regularization  using  M  is  still  key  to  avoiding  the  curse  of  di¬ 
mensionality,  as  replacing  the  sampling  step  in  ordinary  particle  filters  by  an  MCMC 
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method  does  not  resolve  the  fundamental  problem  that  we  face  in  high  dimension; 
see  [3]  for  related  discussion). 

Conceptually,  the  idea  introduced  here  is  quite  natural.  The  general  idea  of  lo¬ 
cal  particle  filters  is  that  one  should  introduce  a  spatial  regularization  step  into  the 
filtering  recursion  that  enables  local  sampling.  In  the  block  particle  filter,  this  regu¬ 
larization  is  provided  by  the  blocking  operation  B  that  projects  a  probability  measure 
on  the  class  of  measures  that  are  independent  across  blocks.  In  the  above  algorithm, 
we  aim  to  regularize  instead  by  the  operation  M  that  projects  a  probability  measure 
on  the  class  of  Markov  random  fields  of  order  b.  The  fatal  flaw  in  our  reasoning  is 
that  the  operator  M  that  we  have  defined  implicitly  above  does  not  exist:  the  trun¬ 
cated  conditional  distributions  p(Xv  G  •  XAAAM  —  xNbG)\ M)  are  typically  not 
consistent,  so  there  exists  no  single  probability  measure  that  satisfies  our  definition 
of  Mp. 

Nonetheless,  the  basic  idea  just  discussed  suggests  a  practical  approach  to  ap¬ 
proximating  random  fields  by  Markov  random  fields:  we  can  substitute  the  above 
expression  for  (CnMPp)(Xu  G  ■  |Xy\bd)  in  a  Gibbs  sampler  regardless  of  its  incon¬ 
sistency.  The  algorithm  that  we  will  introduce  in  this  chapter,  the  localized  Gibbs 
sampler  particle  filter,  exactly  implements  this  idea  to  yield  spatially  homogeneous 
estimates  of  the  filter  distribution. 

While  the  analysis  of  the  block  particle  filter  relies  heavily  on  the  Dobrushin  com¬ 
parison  theorem  (Theorem  2.11),  the  analysis  of  the  localized  Gibbs  sampler  particle 
filter  relies  crucially  on  the  one-sided  Dobrushin  comparison  theorem  (Theorem  2.12), 
which  is  needed  to  capture  the  directionality  of  time  embedded  in  the  definition  of 
Gibbs  samplers.  Following  the  same  bias/variance  decomposition  scheme  adopted  in 
Chapter  4,  we  will  prove  a  spatially  homogeneous  one-step  error  bound  for  the  bias 
of  the  localized  Gibbs  sampler  particle  filter  (Theorem  5.4). 

While  this  result  is  extremely  encouraging,  the  analysis  of  the  localized  Gibbs 
sampler  particle  filter  has  proved  to  be  much  more  challenging  than  the  analysis  of 
the  block  particle  filter,  and  a  complete  picture  is  still  lacking.  With  respect  to  the 
proof  strategy  followed  in  Chapter  4,  the  crucial  difficulty  lies  in  establishing  a  decay 
of  correlations  property  for  the  approximate  filter  that  is  uniform  in  time.  While 
we  have  strong  reasons  to  believe  that  this  property  should  hold,  it  seems  that  the 
Dobrushin  comparison  theorems  are  not  adequate  to  capture  it.  A  more  delicate 
analysis  is  needed,  with  new  tools  to  be  developed. 


5.2  Gibbs  sampler 

The  backbone  of  the  localized  Gibbs  sampler  particle  filter  is  the  Gibbs  sampler, 
a  MCMC  algorithm  that  samples  from  a  high-dimensional  distribution  p  on  X  by 
sampling  iteratively  from  the  low-dimensional  distributions  p(Xv  G  •  |A"V  ^^),  v  G  V. 
Henceforth  in  this  chapter,  label  the  elements  of  V  as  {v\, . . . ,  Vd},  where  d  =  card  V, 
and  introduce  the  notation  Vk  :  Vk>  {i’k,Vk+i  ■  . .  ,  tv}.  The  systematic- scan  Gibbs 
sampler  is  the  algorithm  described  in  Figure  5.1. 
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Algorithm  4:  Systematic-scan  Gibbs  sampler 
Data:  Fix  m  >  1,  x  probability  measure  on  X. 

Let  X0  ~  x; 

for  f=l,...,mdo 
for  k  =  1, ...  ,d  do 

|_  Sample  Xfh  ~  p{Xv*  G  •  \Xvi:v^  =  Xvei:Vk~\  Xvw:v*  =  Xjf+i:Vd)-, 

Output:  Xm. 


Figure  5.1:  Systematic-scan  Gibbs  sampler. 


As  described  in  Figure  5.1,  in  the  t-th  round  of  the  algorithm  (that  is  needed  to 
sample  Xf)  each  coordinate  Xf  is  cyclically  obtained  by  sampling  from  the  conditional 
distribution  given  all  other  coordinates  .  The  cyclic  sampling  occurs  system¬ 

atically ,  following  the  ordering  given  by  Vi, . . . ,  v^.  Each  round  of  the  algorithm  is 
usually  referred  to  as  a  sweep  of  the  algorithm. 

The  Gibbs  sampler  is  a  MCMC  method  that  constructs  a  Markov  chain  (Xn)n>0 
that  admits  p  as  its  invariant  measure  (by  construction  p  satisfies  p  =  pP,  where 
P  is  the  transition  kernel  of  the  Markov  chain).  The  main  rationale  is  that  if  the 
Markov  chain  is  quickly  converging  to  equilibrium  ( rapidly  mixing ),  then  for  large  m 
we  can  reliably  interpret  Xm — the  output  of  the  algorithm  in  Figure  5.1 — as  a  random 
variable  whose  distribution  is  close  to  p.  We  refer  to  [8]  for  an  extensive  treatment  of 
MCMC  methods  in  the  context  of  filtering  theory. 

To  facilitate  the  description  of  what  follows,  we  introduce  the  (systematic-scan) 
Gibbs  sampler  sampling  operator. 

Given  a  probability  measure  p  on  X  and  v  G  V,  let  Gvp  be  the  transition  kernel 
defined  as  follows 

Gvp(x,A)  :=  I  p{Xv  G  dcuv\Xv\W  =  xy\{"})  5xv\{v}(du:v\{v})  1  A{u). 

Definition  5.1  (Gibbs  sampler  sampling  operator).  Let  \  be  a  probability  measure 
on  X.  Define  the  Gibbs  sampler  sampling  operator  S^,m  as 

1  N 

s f  P  =  S  N(xG«  ■  ■  ■  G"/r  =  jy  X 

where  X(l), . . . ,  X(N)  are  i.i.d.  samples — each  obtained  by  running  the  algorithm 
described  in  Figure  5. 1  with  respect  to  the  family  of  conditional  distribution  (p( Xv  G 
.  |x Wd ) )t)gy — for  m  sweeps  and  with  initial  distribution  x-  sampling 

operator  defined  in  Definition  2.16  (see  Section  3.3  for  a  discussion  on  how  to  sample 
from  Markov  chains). 

1  Other  sampling  schemes  can  be  considered,  such  as  uniformly  sampling  d-times  the  elements  of 
V  at  each  round  of  the  algorithm.  This  gives  rise  to  the  so-called  random-scan  Gibbs  sampler. 
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Remark  5.2.  In  the  literature  on  Gibbs  samplers  the  typical  empirical  measure  that 
is  considered  has  the  following  form 


1 

N 


N-l 


a+kn 


k= 0 


where  (Xn)n>0  is  the  Markov  chain  generated  by  the  algorithm  in  Figure  5.1,  n0  is  the 
so-called  burn-in  period  that  represents  the  amount  of  time  it  takes  for  the  Markov 
chain  to  reach  its  invariant  distribution  (which  is  the  measure  we  want  to  sample 
from),  and  I  is  the  period  at  which  samples  are  taken  into  consideration  (so  to  have 
samples  that  are  close  to  being  independent).  On  the  other  hand,  in  Definition  5.1  we 
consider  samples  that  are  independent  by  construction  so  to  simplify  the  theoretical 
analysis  of  the  algorithm. 


5.3  Gibbs  sampler  particle  filter 


As  the  block  particle  filter  was  introduced  by  localizing  the  SIR  particle  filter  recur¬ 
sion,  also  the  local  algorithm  that  we  analyze  in  this  chapter  comes  as  a  localization  of 
another  particle  filter — the  Gibbs  sampler  particle  filter — that  we  presently  introduce. 
Fix  N,  m  >  1.  We  define  the  Gibbs  sampler  particle  filter  recursion  as  follows: 


7Tq  /i, 


n—  1 


(n  >  1), 


where  the  recursion  consists  of  three  steps 


vr: 


n—1 


prediction 
- > 


PK-1 


correction 
- > 


CnPK-l 


MCMC 

sampling 


>  <  :=  s cnP7 x. 


n—  1  * 


In  Lemma  5.3  below  we  prove  that  under  the  usual  mixing  conditions  considered 
in  Chapter  4  (e  <  pv  <  e-1  for  0  <  e  <  1),  as  the  number  of  sweeps  m  goes  to 
infinity  the  Gibbs  sampler  particle  filter  recursion  fUf  =  SNfn  C„P7r^_1  converges  to 

'‘n  —  1 

the  “optimal”  SIR  particle  filter  recursion  fr!)  =  S7VCnP7r^_1  (cf.  Remark  3.13).  For 
this  reason  we  have  not  introduced  a  separate  notation  for  the  Gibbs  sampler  particle 
filter  and  the  SIR  particle  filter  introduced  in  Chapter  3. 

Presently,  we  illustrate  one  possible  implementation  of  this  algorithm.  Let 


7 r: 


n—1 


1 

N 


N 

Z=1 


where  Xre_i(l), . . . ,  Xn_i(N)  are  the  samples  coming  from  the  (n  —  l)-th  itera¬ 
tion  of  the  Gibbs  sampler  sampling  operator,  that  is,  the  samples  coming  from 
S Nfr'  Cn_  i  P Ttlf_ 2 .  Recall  that  the  Gibbs  sampler  samples  iteratively  from  the  con- 
ditional  distributions  of  the  measure  it  is  applied  to.  While  there  are  many  ways 
to  implement  this  sampling  scheme  (for  instance,  by  using  rejection-sampling  to 
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directly  sample  from  the  conditional  distributions,  which  are  only  needed  to  be 
known  point- wise),  we  currently  present  a  sampling  procedure  that  takes  place  in 
multiple  stages  so  to  resemble  the  sampling  scheme  adopted  for  ordinary  particle 
filters  (with  importance  weights,  see  Chapter  3).  This  formulation  will  be  functional 
to  highlight,  at  least  at  a  heuristic  level,  the  reason  why  also  the  Gibbs  sampler 
particle  filter  algorithm  suffers  from  the  curse  of  dimensionality,  and  it  will  readily 
suggest  a  way  around  the  problem  (see  Section  5.4  below). 

Notice  that  for  any  measure  p  on  X  we  have 

(CnP p)(Xv  G  A|Xnw  =  xnw)  = 

I  p(dz )  ELenw  pW(z’  xW">  pv(z’u)  9v(u,  Yn)  IaM 

f  P{dz )  ELey\M  Pw(z,  %w)  w)  *liv(duj)  gv(u,  Y”) 

Hence,  we  can  write 

(QPTr^pC’  G  =  xv^) 

_  Etin,„gy\M^(^n-i(i),^)  f  pv(Xn_1(i),u;)^v(duj)  gv(u,Y^)  1a(uj) 
EZi  ELevvp,}  Pw (xn-i(i),  xw)  f  pv (Arn_!  (*) ,  w)  ipv (du)  gv (u,  Y” ) 

N 

l=l 

where  the  weights  are  dehned  as 

wv  _  zl(Xn-i{i))Y\w&v\{v}Pw{Xn-i(i),xw) 

E;i^(^-iW)n,enw^(J«-i(*)^“)’ 

and  is  a  transition  kernel  from  X  to  X1’  dehned  as 

,v(  4 n  _  J Pv(z,u)  gv(u,Y”)il}v(du)  1  A(u) 

with 

Z’n(z)  :=  f  p"(z,u)  gv(u,YZ)i,v(du). 

As  the  weights  are  positive  and  EZi  =  1  by  construction,  they  can  be 

interpreted  as  probabilities.  So,  sampling  from  (CnP'7r((_1)(X,;  G  •  =  aAVhd) 

can  be  achieved  by  hrst  sampling  J  from  the  distribution  j  G  {1, . . . ,  N}  — *  W%X(J), 
and  then  sampling  from  q^(Xn_i(J),  ■ )  (note  that  this  is  a  one-dimensional  integral, 
and  one  can  use  one  of  the  methods  in  [  ]  to  sample  from  it,  such  as  rejection- 
sampling).  The  resulting  algorithm  is  given  in  Figure  5. 2. 2 

2Note  that  the  algorithm  illustrated  in  Figure  5.2  differs  a  little  from  the  one  described  in  the 
main  text,  as  for  simplicity  it  is  now  assumed  that  7Tq  =  -^  YlZi  <bc0(i) ,  for  A’o(l),  ■  •  • ,  Xo(lV)  ~  g 
(if  Gq  =  g  as  in  the  main  text,  then  the  weights  W^x(j)’ s  would  be  different).  Moreover,  note  that 
more  clever  implementations  of  this  algorithm  can  be  considered,  but  this  is  beyond  the  scope  of 
our  current  treatment. 
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Algorithm  5:  Gibbs  sampler  particle  filter 
Data:  Fix  n,m,  N  >  1.  Let  the  observations  Yi , . . . ,  Yn  be  given. 
Sample  i.i.d.  X0(i )  ~  /i,  i  =  1, . . . ,  N,  and  let  ntf  =  A  (5jc0(i); 

for  s  =  1, . . . ,  n  do 

Sample  i.i.d.  Ro(i),  i  —  1, . . . ,  N,  from  the  distribution 
for  i  —  1, . . . ,  N  do 
for  f  =  l,...,mdo 
for  k  =  1, . . . ,  d,  do 


Let  R  =  (R?:Vk-1(i),r,R%Xi:Vd(i)),  for  any  r  e  X^; 

Sample  J  from  the  distribution  j  G  N }  — »  ILy’^ (j ) ,  with 


n„ 


(  p“  ( !  U) ,RW )  f  Pv k  (X.-i  (t),w)  r*  (du)  9vk  (u,Y,  k ) 


WVk  (i)  —  — 

q„mnlp  UVkl-i\r^nvk(Y  (  T)  rlrA—  P"fc  (Ag-i  (  J),cj)  g"fc  (ui,YsVk )  (dcj)  . 

Sample/^  (*)  (As_i( JJ,  au)  jpvk(Xs_l(J^UJ)gvk(u>,Ysk)'tPVk(dLj)i 


Let  Xg{i)  =  Rvm{i),  i  —  1, . . . ,  N,  v  —  v±, . . .  v,i,  and  7pf  :  = 
Compute  the  approximate  filter  i r^f  ~  j rfif. 


jf  ]C*=1  3xs(ih 


Figure  5.2:  Gibbs  sampler  particle  filter. 


5.4  Sample  degeneracy  with  dimension 

Also  the  Gibbs  sampler  particle  filter  runs  into  the  curse  of  dimensionality.  Ultimately, 
weight  degeneracy  occurs  for  the  same  reason  why  it  occurs  for  the  SIS  algorithm 
(Section  3.2)  and  for  the  SIR  algorithm  (Section  3.3).  To  make  this  point,  let  us 
recall  the  definition  of  the  weights  (up  to  normalization  factors)  involved  in  these 
algorithms: 


SIS  particle  filter  - »  Wn(i)  oc  nn  gv(xvk(i),  17), 

k= 1  vGV 

SIR  particle  filter  - »  Wn(i)  oc  gv(X”(i),  T^), 

vev 

Gibbs  sampler  particle  filter  - »  W")X(i)  oc  Z”(Xn_i(i))  pw(Xn_i(i),  xw), 

w£V\{v} 

(clearly,  different  algorithms  involve  different  particles).  Heuristically  it  is  easy  to 
see  where  the  problem  of  weight  degeneracy  comes  from:  roughly  speaking,  weights 
get  picked  towards  zero  or  infinity  exponentially  fast  with  the  dimension  card  V.  In 
the  SIS  particle  filter  the  problem  appears  both  with  time  and  space  (see  Section 
3.2.1).  In  the  SIR  particle  filter  the  problem  is  caused  by  the  product  of  observation 
likelihoods  (see  Section  3.3.3),  while  in  the  Gibbs  sampler  particle  filter  the  problem 
is  caused  by  the  product  of  transition  likelihoods. 

Proceeding  in  the  same  line  of  thoughts,  it  is  easy  to  see  why  the  block  particle 
filter  (see  Section  4.3  for  its  definition)  can  overcome  the  curse  of  dimensionality. 
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Note,  in  fact,  that  the  weights  involved  in  this  algorithm  read 


Block  particle  filter  - »  Wjf  (i)  oc  gv(X”(i),  Y”), 

vEK 

where  the  product  of  observation  likelihoods  is  restricted  only  to  the  coordinates  in 
the  block  K  C  V .  ffence,  the  block  particle  filter  samples  at  each  coordinate  v  by 
using  weights  that  are  defined  only  through  coordinates  contained  in  the  element  K 
of  the  partition  X  such  that  v  G  K.  So,  even  if  the  dimensionality  of  the  whole 
model  (card  id)  is  increased,  what  matters  for  the  sake  of  weight  degeneracy  is  only 
the  dimensionality  of  the  blocks  (card  K). 

Following  this  intuition,  a  spatially-homogenous  procedure  to  localize  the  Gibbs 
particle  filter  readily  suggests  itself.  If  in  this  algorithm  we  replace  the  measure 
(CnP p)(Xv  G  =  x'hh’} )  with  the  following  measure 

f  p(dz)  El uieNbiv)\{v}Pw(z,xw)pv(z,u)^v(dw)gv(u,Y^) 

I  P(dz)  Tlu,eNb(v)\{v}Pw(zixW)Pv(ziUJ)  ^v(du)  gv(u,  Y”) 

where  the  product  over  w  G  V  is  replaced  with  the  product  over  w  G  Nf,(v)  :=  {V  G 
V  :  d(v,v')  <  b},  then  the  new  algorithm — which  we  call  localized  Gibbs  sampler 
particle  filter — would  yield  weights  of  the  following  form: 

Localized  Gibbs  sampler  particle  filter 

- >  WZJi)ocZZ (X„_i(i))  P  p% ¥„_,(<), x”). 

w£Nb(v)\{v} 

That  is,  the  new  algorithm  samples  at  each  coordinate  v  by  using  weights  that  are 
defined  only  through  coordinates  contained  in  a  ball  of  radius  b  centered  at  v.  Thus, 
we  obtain  a  spatially  homogeneous  way  of  localizing  the  sampling  step,  using  the 
Gibbs  sampler  as  a  way  of  constructing  a  high-dimensional  distribution  from  its  con¬ 
ditional  distributions  (as  discussed  in  Section  5.1,  recall  that  this  localization  can  not 
be  described  as  =  SA  CnMP7r((_  _1;  since  the  measure  MP7r((_1  does  not  exists).  The 
resulting  algorithm  is  immediately  given  as  in  Figure  5.2  upon  truncating  the  weights 
as  we  just  mentioned. 


5.5  Localized  Gibbs  sampler  particle  filter 

We  now  introduce  a  more  convenient  description  of  the  localized  Gibbs  sampler  par¬ 
ticle  filter.  For  each  probability  measure  p  on  X  and  each  n  >  1,  v  G  V,  define  the 
probability  kernels  r/"  and  ffiip  from  X  to  Xv  respectively  as: 

„  .  S  p(dz)  n wey\{v}Pw(z,xW)pv(z,w)4>v(dv)gv(v,Y”)  IaH 

r,n,pXl  ’  I  P(dz)  Uwev\{v}Pw(zixW)Pv(ziUJ)^v(duj)gv{uj,Yv) 

f  P(dz)  UweNb(v)\{v}  Pw(z> xW )  Pv(z’  “)  V{dui)  gv(io,  Y”)  IaM 
Vn,p{ ^  I  P(dz)  rW(,)\W  !>”(*,  *w)  Pv(*, «)  gv(u,  Y*) 
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It  is  easy  to  verify  that 


<P0,  -4) 


(CnPp)(Xv  G  A\Xv\{v}  = 

P p(Xl  G  A\\\  =  Yn,X^{v}  =  xnw), 


while 


V 


V 

n,p 


(x,A)  =  PP{X l  G  A\Y*biv) 


YNb(v)  j{Nb(v)\{v}  _  xNb(v)\{v} 


corresponds  to  the  localized  quantity  (5.1).  Let  us  also  define  the  probability  kernels 
Gvnp  and  Gvnp  from  X  to  X  respectively  as 


Gv,Jx,A)  :=  f  vlp(x,^)S,yW(duv^ W)1,H 
GIiP(x,A)  :=  I iil:f,(x,duv)SIv\i,t(dwv''lv})lA(u), 


and  the  operators  on  probability  measures 

■  =  f  d(dx)Gvn  p(x,dx')  f(x'), 
(Gn.P^)/ :=  f  v{dx)Gvn  p(x,dx')  f(x'). 


From  the  dehnition  of  the  Gibbs  sampler  sampling  operator  (Definition  5.1)  we  can 
write 


S^mCnP P  =  S Np(Gl)p  •  •  •  G^pr  =  SN{C^,  •  ■ .  G lprP. 

Therefore,  the  Gibbs  sampler  particle  filter  can  be  formulated  as 


~  U 

7T0  p, 


"Vl 


^  =  5^(Gvd  G 


\m 


n: 


71—1' 


At  this  point  it  is  straightforward  to  describe  the  localization  procedure  previously 
discussed  and  to  define  the  localized  Gibbs  sampler  particle  filter  as 


^  LI 

K  ■=  d, 


:=  SN(GVd . 

n  V  TI..7T 


•••G 


Vi 


\m 


7 r: 


n— 1  ’ 


In  the  special  case  b  =  maxvy£v  d(v,  v')  the  localized  Gibbs  sampler  particle  filter 
reduces  to  the  Gibbs  sampler  particle  filter,  so  that  the  former  is  a  strict  generalization 
of  the  latter  (we  have  therefore  not  introduced  a  separate  notation  for  the  localized 
Gibbs  sampler  particle  filter:  in  the  remaining  of  this  chapter,  the  notation  always 
refers  to  the  localized  Gibbs  sampler  particle  Liter). 

Before  moving  to  the  analysis  of  the  localized  Gibbs  sampler  particle  Liter  in 
the  next  section,  we  now  prove  that  under  the  mixing  conditions  e  <  pv  <  £~v  for 
0  <  e  <  1  the  Gibbs  sampler  particle  Liter  recursion  =  SN,lm  CnP converges 

n—1 

in  the  limit  of  inhnitely  many  sweeps  (m  — »  oo)  to  the  “optimal”  SIR  particle  Liter 
recursion  i r%  =  SA  CnP7r^_1  (cf.  Remark  3.13).  Recall  that  Fn  :=  CnP. 
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Lemma  5.3  (Convergence  of  Gibbs  sampler  particle  filter).  Suppose  there  exists 
0  <  e  <  1  such  that 


£  <  pv(x,  zv )  <  £  1,  Vn  G  V,  x,  z  E  X. 

Then,  for  each  probability  measures  p,  p  onH  and  each  n  >  1  we  have 

j'm  iif„p-mg;;:,---g^)'h  =  o. 

£-»oo 

Proof.  From  the  local  mixing  conditions  for  each  pv  we  get  the  following  minorization 
condition  for  each  77"  , 


Vn,p(XiA)  >  £2{d  1}Xn(^)> 

where 

=  /  p(dz)  pv{z ,  u)  ^v(du)  gv{u,  Yf)  lA(u) 

f  p(dz)pv(z,u)Tfv(duj)gv(uj,Yv) 

Hence,  we  also  get  the  following  minorization  condition  for  an  entire  sweep  of  the 
Gibbs  sampler, 

where  x{A)  '■=  J  ®vevXn(dzV)  1  a(z).  As  by  construction  for  each  v  G  V  the  kernel 
G^p  leaves  invariant  the  measure  F np,  that  is, 

(F  nP)Glp  =  F  np, 

then  by  Lemma  2.10  we  have 

II f„p  -MGV-ccy'l-  KM <G»t  •  •  •  G"nJ  -  MGS, ■  ■  ■  || 


□ 


5.6  Main  result:  spatially- homogeneous  error 

bound 


Ultimately,  we  would  like  to  mimic  the  result  of  Theorem  4.2  for  the  localized  Gibbs 
sampler  particle  filter,  and  to  prove  a  bound  for  \\\n!f  —  n%\\\  7  that  is  uniform  both  in 
time  (n)  and  in  the  model  dimension  (cardU),  and  that  is  spatially-homogeneous  in 
J  C  V.  Although  at  the  time  being  we  do  not  have  such  a  result,  Theorem  5.4  below 
represents  an  encouraging  first  step  towards  establishing  it. 

Following  the  strategy  pursued  in  Chapter  4,  we  define  the  approximate  Gibbs 
sampler  filter  as 


TV, 


0  Ti 


n—  1 


V- 
n  —  1 


'  TV 


n—  1  ’ 


and  we  consider  the  following  error  decomposition  (cf.  Section  4.5.1) 


7T 


‘ralllj  ’ 


bias 


variance 


Recall  the  following  definitions  from  Chapter  4  and  Appendix  A.  For  any  proba¬ 
bility  measure  p  on  X  and  x,  z  G  X,  v,  v'  G  V,  (3  >  0,  let 


A 

£{dzv) 


^  vv' 

Corr(/i,  /3) 


max  cardin'  G  V  :  d(v,v')  <  r}, 

vGV 

fj,(dzv\xv^v^), 

I  U(xv)Uu,mv)PV>(^Zv>)  /£(ds”) 

jnweN(v)Pw(x,zW)vVx(dxV) 

1 

O  SUP  SUP  Wx,z  -  f4,z\l 

Z  2SX  a:,:reX:xV\-{y}=.jV\{i/} 


max  V  e(M^Cfv„ 

vGV  ^  VV 


v'GV 


The  following  result  provides  a  spatially  homogeneous  one-step  error  bound  for 
the  bias  term  of  the  localized  Gibbs  sampler  particle  filter. 

Theorem  5.4  (Localized  Gibbs  sampler  particle  filter,  one-step  error  for  the  bias). 
There  exists  a  constant  0  <  Eq  <  1  depending  only  on  the  local  quantity  A  such  that 
the  following  holds. 

Suppose  there  exists  e0  <  £  <  1  such  that 

£  <  pv(x,  zv)  <  £  l  for  all  v  €  V,  x,  z  G  X, 


and  let  p  be  a  probability  measure  on  X  such  that 

Corr(p,  /3)  < 

where  (3  =  ^  log  •  Then,  for  each  n  >  1  and  J  C  V  we  have 

\\Fnp-Fnp\\j<acaYdJe-^b’m\ 
where  the  constants  0  <  0,7  <  00  depend  only  on  e,  r,  and  A. 

We  refer  to  Appendix  B  for  the  proof  of  Theorem  5.4.  While  in  the  case  of  the 
block  particle  filter  the  key  insight  to  perform  the  analysis  is  that  both  filter  and 
approximate  block  filter  can  be  thought  of  as  Gibbs  measures  on  properly-defined 
graphs  (see  Section  4.5.2  and  Remark  4.9),  in  the  present  case  the  key  insight  is  that 
both  filter  and  approximate  Gibbs  sampler  filter  can  be  thought  of  as  Gibbs  samplers. 

In  fact,  as  by  construction  for  each  v  G  V  the  kernel  Gvnp  leaves  the  measure  F„p 
invariant  (that  is,  (F np)Gvnp  =  Fnp),  then  we  can  express  the  filter  recursion  as  m 
sweeps  of  a  Gibbs  sampler,  namely, 


F  np 


(F  np){Gl\p 
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On  the  other  hand,  the  approximate  Gibbs  sampler  filter  recursion  is  defined  as 

F np  :=P(G-  -G^r- 

The  key  idea  to  bound  ||  F np—  F„p||  j  is  then  to  use  the  one-sided  Dobrushin  comparison 
theorem  (Theorem  2.12)  to  capture  the  one-sidedness  that  is  embedded  in  the  Gibbs 
samplers  F np  and  F np. 

Remark  5.5  (On  running  the  algorithm).  The  one-step  error  bound  in  Theorem  5.4 
suggests  that  in  practice  we  only  need  to  run  the  localized  Gibbs  sampler  particle  filter 
for  a  number  of  sweeps  (m )  that  is  of  the  same  order  as  the  radius  (b )  at  which  we 
implement  the  localization.  From  the  analysis  of  the  block  particle  filter  ( see  Section 
4-4)  we  know  that  the  optimal  b  increases  quite  slowly  with  the  number  of  particles 
N  (b  ~  log1/<?  N  for  the  square  lattice  V  =  {—d, . . .  ,d}q),  which  suggests  that  each 
iteration  of  the  algorithm  does  not  need  to  be  run  for  many  sweeps. 


5.7  Where  things  stand 

Theorem  5.4  yields  a  bound  on  the  one-step  error  ||F„p  —  Fnp||j  under  a  certain 
assumption  on  the  decay  of  correlations  for  the  measure  p.  In  order  to  use  this  result 
within  the  general  error  decomposition  scheme  pursued  in  Section  4.5.3  to  bound  the 
bias  term  |||7 —  7f^|||j,  we  need  to  prove  that  the  appropriate  decay  of  correlations 
property  does  in  fact  hold,  uniformly  in  time,  for  the  approximate  filter  i iff.  That  is, 
we  would  like  to  prove  that 


sup  Corr(7r((,  j3)  <  c  <  1, 

n>  o 


where  c  is  an  absolute  constant  which  does  not  depend  on  the  ambient  dimension. 

In  the  case  of  the  block  particle  filter  we  can  show  this  property  by  iterating  a 
one-step  decay  of  correlations  bound  that  is  obtained  using  the  Dobrushin  comparison 
theorem  (see  Section  A. 3).  In  the  case  of  the  localized  Gibbs  sampler  particle  filter, 
however,  the  situation  is  more  involved  as  we  need  to  control  the  way  the  decay  of 
correlations  is  propagated  in  each  iteration  of  the  Gibbs  samplers.  While  we  have 
strong  reasons  to  believe  that  the  decay  of  correlations  of  the  approximate  filter 
should  hold  uniformly  in  time,  at  the  time  being  we  have  been  not  successful  in 
establishing  the  required  behavior  using  the  Dobrushin  comparison  method,  and  new 
mathematical  tools  seem  to  be  needed. 

To  see  why  we  expect  the  decay  of  correlations  property  to  hold,  consider  the 
case  of  the  filter  recursion.  While  the  Dobrushin  comparison  theorem  can  be  used  to 
bound  the  quantity  Corr(Fnp,  j3)  by  making  an  assumption  on  Corr(p,  j3)  (as  done  in 
Section  A. 3,  see  Proposition  A. 9  in  particular),  it  seems  not  possible  to  use  the  same 
machinery  to  bound  Corr (p(G^p  ■  ■  ■  G^pY,  /3),  for  any  given  finite  £,  without  making 
higher-order  assumptions  on  the  decay  of  correlations  of  p,  although  we  know  that 
Hindoo  piGf'.p  ■  ■  ■  G'ffiY  =  F np  as  seen  in  Lemma  5.3. 


The  approach  that  we  have  presented  to  bound  the  bias  of  the  localized  Gibbs 
sampler  particle  filter  is  taken  from  the  analysis  of  the  block  particle  filter  given 
in  Chapter  4,  and  it  is  based  on  the  recursive  property  of  the  filter.  On  the  other 
hand,  the  improved  analysis  of  the  block  particle  filter  that  will  be  given  in  the 
next  chapter  (Section  6.4)  is  based  on  another  strategy  that  allows  to  directly  use 
the  Dobrushin  comparison  theorem  on  properly-defined  space-time  Gibbs  measures, 
without  considering  the  filter  recursion.  This  new  method  yields  a  shorter  proof  for 
the  bound  of  the  bias  term  that  does  not  involve  controlling  the  decay  of  correlations 
quantity  Corr(7T^,  /3).  This  approach  relies  on  the  ability  to  express  both  filter  and 
approximate  filter  as  the  marginal  of  properly-defined  Markov  random  fields,  where 
the  natural  interaction  range  of  the  system  (recall  that  the  models  introduced  in 
Section  4.1  have  an  interaction  of  range  r)  can  be  recovered  through  the  interaction 
neighborhood  of  the  field  (cf.  the  discussion  on  the  Dobrushin  comparison  method  in 
Section  4.5.2). 

The  problem  in  implementing  this  approach  in  the  case  of  the  localized  Gibbs 
sampler  particle  filter  lies  in  the  fact  that  this  algorithm  is  defined  in  terms  of  a 
recursion  that  does  not  seem  to  admit  an  intrinsic  probabilistic  interpretation  that 
can  allow  to  recover  the  natural  interaction  range  of  the  model.  Ultimately,  the 
problem  is  that  this  algorithm  is  defined  in  terms  of  conditional  probabilities  that  do 
not  have  a  local  structure  (see  Section  5.5).  In  fact,  by  definition,  fj^p(x,A)  depends 
on  xw  whenever  d(v,  w )  <  b.  Therefore,  even  if  we  can  interpret  the  measure  7r^  as  the 
marginal  of  a  properly-defined  space-time  Gibbs  measure,  this  measure  is  a  Markov 
random  field  with  interaction  neighborhood  size  b ,  which  does  not  correspond  to  the 
intrinsic  neighborhood  size  r. 
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Chapter  6 

Comparison  theorems  for  Gibbs 
measures 


This  chapter  is  devoted  to  establishing  new  comparison  theorems  for  Gibbs  mea¬ 
sures  that  substantially  extend  the  range  of  applicability  of  the  classical  Dobrushin 
comparison  theorem,  the  main  tool  behind  the  proofs  of  the  results  presented  in  the 
previous  two  chapters  for  the  analysis  of  filtering  algorithms  in  high  dimension.  The 
novel  toolbox  will  be  used  to  extend  the  analysis  of  the  block  particle  filter  given  in 
Chapter  4  to  the  case  where  spatial  and  temporal  ergodicity  are  treated  on  a  different 
footing.  This  chapter  is  based  on  the  paper  [  ]. 


6.1  Motivations 

The  analysis  of  the  block  particle  filter  in  Chapter  4  and  the  analysis  of  the  localized 
Gibbs  sampler  particle  filter  in  Chapter  5  rely  heavily  on  the  Dobrushin  comparison 
theorem  introduced  in  Section  2.4,  which  is  a  powerful  tool  to  obtain  dimension- 
free  estimates  on  the  difference  between  the  marginals  of  Gibbs  measures  p  and  p 
in  terms  of  the  single  site  conditional  distributions  p{ A"-7  G  dad  =  x Cfal)  and 

p(Xi  G  dxj |AJ\D>  =  jAC). 

In  order  to  ensure  decay  of  correlations,  Theorem  4.2  and  Theorem  5.4  (the  main 
results  of  Chapter  4  and  Chapter  5,  respectively)  impose  a  weak  interactions  assump¬ 
tion  (e  <  pv  <  e-1  for  e  >  e0)  that  is  dictated  by  the  comparison  theorem.  As 
explained  in  Section  4.4.2,  this  assumption  is  unsatisfactory  already  at  the  qualita¬ 
tive  level:  it  limits  not  only  the  spatial  interactions  (as  is  needed  to  ensure  decay  of 
correlations)  but  also  the  dynamics  in  time.  Overcoming  this  unnatural  restriction 
requires  a  generalized  version  of  the  comparison  theorem,  which  is  one  of  the  main 
motivation  for  the  results  developed  in  this  chapter. 

More  generally,  aside  from  the  filtering  framework  considered  in  the  previous  chap¬ 
ters,  the  Dobrushin  comparison  theorem  has  proved  to  be  useful  to  establish  numer¬ 
ous  properties  of  Gibbs  measures,  including  uniqueness,  decay  of  correlations,  global 
Markov  properties,  and  analyticity  [27,  45,  25],  as  well  as  functional  inequalities  and 
concentration  of  measure  properties  [29,  32,  67]. 
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Despite  this  broad  array  of  applications,  the  range  of  applicability  of  the  Do- 
brushin  comparison  theorem  proves  to  be  somewhat  limited.  This  can  already  be 
seen  in  the  easiest  qualitative  consequence  of  this  result:  the  comparison  theorem 
implies  uniqueness  of  the  Gibbs  measure  under  the  well-known  Dobrushin  unique¬ 
ness  criterion  [18].  Unfortunately,  this  criterion  is  restrictive:  even  in  models  where 
uniqueness  can  be  established  by  explicit  computation,  the  Dobrushin  uniqueness  cri¬ 
terion  holds  only  in  a  small  subset  of  the  natural  parameter  space  (see,  e.g.,  [6-1]  for 
examples).  This  suggests  that  the  Dobrushin  comparison  theorem  is  a  rather  blunt 
tool.  On  the  other  hand,  it  is  also  known  that  the  Dobrushin  uniqueness  criterion  can 
be  substantially  improved:  this  was  accomplished  in  Dobrushin  and  Shlosman  [17] 
by  considering  a  local  description  in  terms  of  larger  blocks  p(XJ  G  dxJ\X1'xJ  =  xr'"'J) 
instead  of  the  single  site  specification  p(XJ  e  dxJ\XIxX}  —  in  this  manner, 

it  is  possible  in  many  cases  to  capture  a  large  part  of  or  even  the  entire  uniqueness 
region.  The  uniqueness  results  of  Dobrushin  and  Shlosman  were  further  generalized 
by  Weitz  [64],  who  developed  remarkably  general  combinatorial  criteria  for  unique¬ 
ness.  However,  while  the  proofs  of  Dobrushin- Shlosman  and  Weitz  also  provide  some 
information  on  decay  of  correlations,  they  do  not  provide  an  analogue  of  the  powerful 
general-purpose  machinery  that  the  Dobrushin  comparison  theorem  yields  in  its  more 
restrictive  setting. 

The  general  aim  of  the  present  chapter  is  to  fill  this  gap.  Our  main  results  (Theo¬ 
rem  6.4  and  Theorem  6.12)  provide  a  direct  generalization  of  the  Dobrushin  compari¬ 
son  theorem  to  the  much  more  general  setting  considered  by  Weitz  [64] ,  substantially 
extending  the  range  of  applicability  of  the  classical  comparison  theorem. 

While  the  original  comparison  theorem  is  an  immediate  consequence  of  our  main 
result  (Corollary  6.6),  the  classical  proof  that  is  based  on  the  “method  of  estimates” 
does  not  appear  to  extend  easily  beyond  the  single  site  setting.  We  therefore  develop 
a  different,  though  certainly  related,  method  of  proof  that  systematically  exploits  the 
connection  of  Markov  chains.  In  particular,  our  main  results  are  derived  from  a  more 
general  comparison  theorem  for  Markov  chains  that  is  applied  to  a  suitably  defined 
family  of  Gibbs  samplers.  The  proofs  of  the  new  comparison  theorems  are  contained 
in  Appendix  C,  Sections  C.1-C.5. 

As  an  application  of  the  generalized  comparison  theorems,  in  Section  6.4  we 
present  an  improved  analysis  of  the  block  particle  filter  introduced  in  Chapter  4. 
The  proof  of  this  result  is  provided  in  Appendix  C,  Section  C.6. 

6.2  Setting  and  notation 

We  begin  by  introducing  the  basic  setting  that  will  be  used  throughout  this  section. 

Sites  and  configurations 

Let  /  be  a  finite  or  countably  infinite  set  of  sites.  Each  subset  J  C  /  is  called  a  region ; 
the  set  of  finite  regions  will  be  denoted  as 

J  :=  {J  C  I  :  card  J  <  oo}. 
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To  each  site  i  G  /  is  associated  a  measurable  space  §*,  the  local  state  space.  A 
configuration  is  an  assignment  X;  G  §*  to  each  site  i  G  /.  The  set  of  all  configurations 
§,  and  the  set  §J  of  configurations  in  a  given  region  J  C  /,  are  defined  as 

§:=n§\ 

*6/  ieJ 

For  x  =  (. Xi)iei  G  §,  we  denote  by  xJ  :=  (xj)iej  G  §J  the  natural  projection  on  §J. 
When  J  fl  K  —  0,  we  define  z  =  xJyA  G  §JuA  such  that  zJ  =  xJ  and  zK  =  yK . 


Local  functions 

A  function  /:§—>■  M  is  said  to  be  J -local  if  /(x)  =  f(z)  whenever  xJ  =  zJ ,  that  is, 
if  /(x)  depends  on  xJ  only.  The  function  /  is  said  to  be  local  if  it  is  J-local  for  some 
finite  region  J  G  J.  When  /  is  a  finite  set,  every  function  is  local.  When  /  is  infinite, 
however,  we  will  frequently  restrict  attention  to  local  functions.  More  generally,  we 
will  consider  a  class  of  “nearly”  local  functions  to  be  defined  presently. 

Given  any  function  /  :  §  — *  M,  let  us  define  for  J  G  3  and  x  G  §  the  J-local 
function 

Then  /  is  called  quasilocal  if  it  can  be  approximated  pointwise  by  the  local  functions 

fx: 

lim  |  fx(z)  —  f(z)  |  =  0  for  all  x,  z  G  S, 

where  limjejaj  denotes  the  limit  of  the  net  (aj)jej  where  J  is  directed  by  inclusion 
C  (equivalently,  aj  — >■  0  if  and  only  if  aji  — >  0  for  every  sequence  Ji,  J2, . . .  G  J  such 
that  .J\  C  J2  C  ■  ■  ■  and  (J?;  Jl  =  I).  Let  us  note  that  this  notion  is  slightly  weaker 
than  the  conventional  notion  of  quasilocality  used,  for  example,  in  [27]. 


Metrics 


In  the  sequel,  we  fix  for  each  i  G  /  a  metric  r/?  on  §*  (we  assume  throughout  that  rp 
is  measurable  as  a  function  on  §*  x  §*).  We  will  write  \\rii\\  =  sup x  rji(x,z). 

Given  a  function  /  :  §  — y  M  and  i  G  /,  we  define 


osc  J 


sup 


I  f(x)  -  f(z)  I 

r]i{xi,Zi) 


The  quantity  osc  if  measures  the  variability  of  f(x)  with  respect  to  the  variable  Xj. 


Matrices 

The  calculus  of  possibly  infinite  nonnegative  matrices  will  appear  repeatedly  in  the 
sequel.  Given  matrices  A  =  and  B  =  (. with  nonnegative  entries 

Aij  >  0  and  B,j  >  0,  the  matrix  product  is  defined  as  usual  by 

(AB)ij  =  ^  AikBkj. 

fee/ 
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This  quantity  is  well  defined  as  the  terms  in  the  sum  are  all  nonnegative,  but  (AB)ij 
may  possibly  take  the  value  +oo.  As  long  as  we  consider  only  nonnegative  matrices, 
all  the  usual  rules  of  matrix  multiplication  extend  to  infinite  matrices  provided  that 
we  allow  entries  with  the  value  +oo  and  that  we  use  the  convention  +oo  -0  =  0 
(this  follows  from  the  Fubini-Tonelli  theorem,  cf.  [20,  Chapter  4]).  In  particular,  the 
matrix  powers  Ak ,  k  >  1  are  well  defined,  and  we  define  A0  =  I  where  /  :=  (1 
denotes  the  identity  matrix.  We  will  write  A  <  oo  if  the  nonnegative  matrix  A 
satisfies  Atj  <  oo  for  every  i,j  G  I. 

Kernels,  covers,  local  structure 

Recall  that  a  transition  kernel  7  from  a  measurable  space  (12,  T)  to  a  measurable 
space  (12',  T')  is  a  map  7  :  12  x  1'  — >  M.  such  that  u>  1— >  7 ^(A)  is  a  measurable  function 
for  each  A  G  T'  and  7 w(-)  is  a  probability  measure  for  each  00  G  12,  cf.  [  ].  Given  a 

probability  measure  /a  on  12  and  function  /  on  12',  we  define  as  usual  the  probability 
measure  (/xy  )(A)  =  f  /u(dcj)'jUJ(A )  on  12'  and  function  (7  =  f  ^u(doj')  f  (a/)  on  12. 

A  transition  kernel  7  between  product  spaces  is  called  quasilocal  if  7/  is  quasilocal 
for  every  bounded  and  measurable  quasilocal  function  /. 

Our  interest  throughout  this  chapter  is  in  models  of  random  configurations,  de¬ 
scribed  by  a  probability  measure  /i  on  S.  We  would  like  to  understand  the  prop¬ 
erties  of  such  models  based  on  their  local  structure.  A  natural  way  to  express  the 
local  structure  in  a  finite  region  J  G  J  is  to  consider  the  conditional  distribution 
7 x(dzJ)  =  G  dzJ\XI^J  =  x^J)  of  the  configuration  in  J  given  a  fixed  config¬ 

uration  x^J  for  the  sites  outside  J:  conceptually,  yJ  describes  how  the  sites  in  J 
“interact”  with  the  sites  outside  J .  The  conditional  distribution  yJ  is  a  transition 
kernel  from  §  to  §J.  To  obtain  a  complete  local  description  of  the  model,  we  must 
consider  a  class  of  finite  regions  J  that  covers  the  entire  set  of  sites  I.  Let  us  call  a 
collection  of  regions  fl  C  J  a  cover  of  /  if  every  site  i  G  /  is  contained  in  at  least  one 
element  of  3  (note  that,  by  definition,  a  cover  contains  only  finite  regions).  Given  any 
cover  3,  the  collection  (7 J)jeg  provides  a  local  description  of  the  model. 

In  fact,  our  main  results  will  hold  in  a  somewhat  more  general  setting  than  is 
described  above.  Let  /x  be  a  probability  measure  on  §  and  7J  be  transition  kernel 
from  §  to  §J.  We  say  that  /x  is  7 J -invariant  if  for  every  bounded  measurable  function 
/ 

J  n{dx)f{x)=  I  n(dx)^Jx(dzJ)  f(zJx^J)-, 

by  a  slight  abuse  of  notation,  we  will  also  write  /if  =  /xy*7/7.  This  means  that  if 
the  configuration  x  is  drawn  according  to  /x,  then  its  distribution  is  left  unchanged 
if  we  replace  the  configuration  xJ  inside  the  region  J  by  a  random  sample  from  the 
distribution  y7,  keeping  the  configuration  x 7\J  outside  J  fixed.  Our  main  results 
will  be  formulated  in  terms  of  a  collection  of  transition  kernels  (7 J)jeg  such  that 
3  is  a  cover  of  /  and  such  that  /x  is  yJ-invariant  for  every  Jed-  If  we  choose 
7 i(dzJ)  =  n{XJ  G  dzJ\XJ\J  =  x^J)  as  above,  then  the  yJ-invariance  of  /x  holds 
by  construction  [31,  Theorem  6.4];  however,  any  family  of  7 J- invariant  kernels  will 
suffice  for  the  validity  of  our  main  results. 
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Remark  6.1  (Gibbs  measures  and  specifications).  The  idea  that  the  collection  (7 J)  jeg 
provides  a  natural  description  of  high- dimensional  probability  distributions  is  prevalent 
in  many  applications.  In  fact,  in  statistical  mechanics,  the  model  is  usually  defined 
in  terms  of  such  a  family.  To  this  end,  one  fixes  a  priori  a  family  of  transition 
kernels  ( 7J)j &,  called  a  specification,  that  describes  the  local  structure  of  the  model. 
The  definition  of  7J  is  done  directly  in  terms  of  the  parameters  of  the  problem  ( the 
potentials  that  define  the  physical  interactions,  or  the  local  constraints  that  define  the 
combinatorial  structure).  A  measure  p  on  §  is  called  a  Gibbs  measure  for  the  given 
specification  if  p(XJ  6  =  x I\J)  =  7  f(dzJ)  for  every  J  G  3.  The  existence 

of  a  Gibbs  measure  allows  to  define  the  model  /j  in  terms  of  the  specification.  It 
may  happen  that  there  are  multiple  Gibbs  measures  for  the  same  specification:  the 
significance  of  this  phenomenon  is  the  presence  of  a  phase  transition,  akin  to  the 
transition  of  water  from  liquid  to  solid  at  the  freezing  point.  As  the  construction  of 
Gibbs  measures  from  specifications  is  not  essential  for  the  validity  or  applicability  of 
our  results,  we  omit  further  details.  We  refer  to  [27,  f5,  64]  for  extensive  discussion, 
examples,  and  references. 


6.3  General  comparison  theorem 

Let  p  and  p  be  probability  measures  on  the  space  of  configurations  §.  Our  main 
result,  Theorem  6.4  below,  provides  a  powerful  tool  to  obtain  quantitative  bounds  on 
the  difference  between  p  and  p  in  terms  of  their  local  structure.  Before  we  can  state 
our  results,  we  must  first  introduce  some  basic  notions.  Our  terminology  is  inspired 
by  Weitz  [64], 

As  was  explained  above,  the  local  description  of  a  probability  measure  p  on  §  will 
be  provided  in  terms  of  a  family  of  transition  kernels.  We  formalize  this  as  follows. 

Definition  6.2.  A  local  update  rule  for  p  is  a  collection  (7 J)jeg  where  3  is  a  cover 
of  I,  7J  is  a  transition  kernel  from  §  to  §J  and  p  is  rpJ -invariant  for  every  J  e  3- 

In  order  to  compare  two  measures  p  and  p  on  the  basis  of  their  local  update 
rules  (7 J)jeg  and  (7 J)jeg,  we  must  quantify  two  separate  effects.  On  the  one  hand, 
we  must  understand  how  the  two  models  differ  locally:  that  is,  we  must  quantify 
how  7^  and  differ  when  acting  on  the  same  configuration  x.  On  the  other  hand, 
we  must  understand  how  perturbations  to  the  local  update  rule  in  different  regions 
interact:  to  this  end,  we  will  quantify  the  extent  to  which  7^  and  7/  differ  for  different 
configurations  x,  z.  Both  effects  will  be  addressed  by  introducing  a  suitable  family  of 
couplings.  Recall  that  a  probability  measure  Q  on  a  product  space  0  x  0  is  called  a 
coupling  of  probability  measures  /i,  v  on  O  if  its  marginals  coincide  with  //,  v ,  that  is, 
Q(  ■  x  0)  =  p,  and  Q(f2  x  • )  =  u. 

Definition  6.3.  A  coupled  update  rule  for  (p,p)  is  a  collection  (7J,  7J,  QJ ,  QJ)  Je3, 
where  3  is  a  cover  of  I ,  such  that  the  following  properties  hold: 

1-  (7 J)j£3  and  (7 J)jeg  are  local  update  rules  for  p  and  p,  respectively. 
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2.  QJXZ  is  a  coupling  of  7^,7/  for  every  J  e  3  and,  x,  z  G  §  with  card{i  :  Xi  7^ 
Zij  =  1. 

3.  QJX  is  a  coupling  of  7/,  7/  for  every  J  G  3  and  i6§, 

We  can  now  state  onr  main  result.  The  proof  will  be  given  in  Appendix  C,  Sections 
C.1-C.3. 


Theorem  6.4  (General  comparison  theorem,  main  result).  Let  8  be  a  cover  of  I , 
let  (tcj)jeg  be  a  family  of  strictly  positive  weights,  and  let  (7J,  7J,  QJ,  <5J)  jea  be  a 
coupled  update  rule  for  (p,  p).  Define  for  i,  j  7  / 


Wy  := 


R 


ij 


aj 


^i=j  WJ  j 

Je3:*eJ 

1 

sup  — 7 - 

a:,zeS:  Vj{Xj,Zj 

xl\{j}=zl\{j} 


wjQJx,zVi, 

Je3:*eJ 


wj 

Je3--jeJ 


P{dx)  Qivj. 


Assume  that  ryJ  is  quasilocal  for  every  J  e  d,  and  that 

Wa  <  1  and  lim  (/  —  W  +  A)”  (p  ®  pW  =  0 

jel 

Then  we  have 


for  all  i  E  I. 


OO 

|p/  ~  p/|  <  ^  oscj  Dij  W~^aj  where  D  :=  ^(W_1A)n, 
ij'er  n=o 


(6.1) 


for  any  bounded  and  measurable  quasilocal  function  f  such  that  osc if  <  00  for  all 

i  E  I. 


Remark  6.5.  While  it  is  essential  in  the  proof  that  r)J  and  7J  are  transition  kernels, 
we  do  not  require  that  QJ  and  QJ  are  transition  kernels  in  Definition  6.3,  that  is, 
the  couplings  Q'fz  and  Qf  need  not  be  measurable  as  functions  of  x,z.  It  is  for  this 
reason  that  the  coefficients  aj  are  defined  in  terms  of  an  outer  integral  rather  than  an 
ordinary  integral  [53]: 


f(x)p(dx) 


inf 


g{x)  p(dx)  :  f  <  g,  g  is  measurable 


When  x  1— >  QJxr)j  is  measurable  this  issue  can  be  disregarded.  In  practice  measura¬ 
bility  will  hold  in  all  but  pathological  cases,  but  may  not  always  be  trivial  to  prove. 
We  therefore  allow  for  nonmeasurable  couplings  for  sake  of  technical  convenience,  so 
that  it  is  not  necessary  to  check  measurability  of  the  coupled  updates  when  applying 
Theorem  6.4. 

We  will  presently  formulate  a  number  of  special  cases  and  extensions  of  Theorem 
6.4  that  may  be  useful  in  different  settings.  A  detailed  application  is  presented  in 
Section  6.4,  where  we  improve  the  analysis  of  the  block  particle  filter  given  in  Chapter 
4. 
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6.3.1  The  classical  comparison  theorem 

The  original  comparison  theorem  of  Dobrushin  [18,  Theorem  3]  and  its  commonly 
used  formulation  due  to  Follmer  [  ]  (i.e.,  Theorem  2.11)  correspond  to  the  special 
case  of  Theorem  6.4  where  the  cover  3  =  3S  ■=  {{0  :  i  G  /}  consists  of  single  sites. 
For  example,  the  main  result  of  [24]  follows  readily  from  Theorem  6.4  under  a  mild 
regularity  assumption.  To  formulate  it,  recall  that  the  Wasserstein  distance  dv(p,v) 
between  probability  measures  p  and  v  on  a  measurable  space  12  with  respect  to  a 
measurable  metric  r/  is  defined  as 

dv(p,  v)  \=  inf  Qr /, 

Q(-xQ)=/x 

Q(Qx-)=is 

where  the  inhmum  is  taken  over  probability  measures  Q  on  12  x  12  with  the  given 
marginals  /i  and  u.  We  now  obtain  the  following  classical  result  (cf.  [24]  and  [25, 
Remark  2.17]). 

Corollary  6.6  ([  ]).  Assume  S*  is  Polish  and  ry  is  lower-semicontinuous  for  all 
i  e  I.  Let  (7 61)ig/  and  (y{d)ieJ  l>e  local  update  rules  for  p  and  p,  respectively,  and  let 

Cij  ■■=  sup 

cc,£E§: 

xl\{j}=zl\{j} 

Assume  that  yhl  is  quasilocal  for  every  i  e  / ,  and  that 

lim  CfAp  ®  p)rjj  =  0  for  all  i  e  I. 
j  ei 


Then  we  have 


\pf  -  pf\  OSCif  Di>  where  D  :=  ^  Cn , 


i,j£l 


n= 0 


for  any  bounded  and  measurable  quasilocal  function  f  such  that  osc if  <  00  for  all 

i  e  I. 


If  Qxl  and  Qx  '!  are  minimizers  in  the  definition  of  drh('}x\'Yzit)  and  dVt (y)-*1 ,  yi*1 ) , 


respectively,  and  if  we  let  3  =  ds  and  =  1  for  all  1  G  /,  then  Corollary  6.6  follows 
immediately  from  Theorem  6.4.  For  simplicity,  we  have  imposed  the  mild  topological 
regularity  assumption  on  §*  and  ry  to  ensure  the  existence  of  minimizers  [62,  Theorem 
4.1]  (when  minimizers  do  not  exist,  it  is  possible  with  some  more  work  to  obtain  a 
similar  result  by  using  near-optimal  couplings  in  Theorem  6.4).  Let  us  note  that 
when  rji(x,z )  =  lX7tz  is  the  trivial  metric,  the  Wasserstein  distance  reduces  to  the 
total  variation  distance 


=  h\p~  v\\  -=l  sup  \pf-isf\ 
2  2  /:||/II<1 


when  rj(x ,  z)  =  lx^z, 
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and  an  optimal  coupling  exists  in  any  measurable  space  [18,  p.  472],  Thus  in  this 
case  no  regularity  assumptions  are  needed,  and  Corollary  6.6  reduces  to  the  textbook 
version  of  the  comparison  theorem  that  appears,  e.g.,  in  [17,  Theorem  8.20]  or  [  5, 
Theorem  V.2.2], 

While  the  classical  comparison  theorem  of  Corollary  6.6  follows  from  our  main 
result,  it  should  be  emphasized  that  the  single  site  assumption  8  =  ds  is  a  significant 
restriction.  The  general  statement  of  Theorem  6.4  constitutes  a  crucial  improvement 
that  substantially  extends  the  range  of  applicability  of  the  comparison  method,  as 
the  application  to  the  block  particle  filter  demonstrates.  Let  us  also  note  that  the 
proofs  in  [18,  24],  based  on  the  “method  of  estimates,”  do  not  appear  to  extend  easily 
beyond  the  single  site  setting.  We  use  a  different  (though  related)  method  of  proof 
that  systematically  exploits  the  connection  with  Markov  chains  (Appendix  C). 

6.3.2  Alternative  assumptions 

The  key  assumption  of  Theorem  6.4  is  (6.1).  The  aim  of  the  present  section  is  to 
obtain  a  number  of  useful  alternatives  to  assumption  (6.1)  that  are  easily  verified  in 
practice. 

We  begin  by  defining  the  notion  of  a  tempered  measure  [25,  Remark  2.17]. 
Definition  6.7.  A  probability  measure  p,  on  §  is  called  .x'*-tempered  if 

SUp  /  p,(dx)  Tji(Xi,  x*)  <  oo. 

iei  J 

In  the  sequel  x*  G  §  will  be  considered  fixed  and  (i  will  be  called  tempered. 

It  is  often  the  case  in  practice  that  the  collection  of  metrics  is  uniformly  bounded, 
that  is,  supj  II^H  <  oo.  In  this  case,  every  probability  measure  on  §  is  trivially 
tempered.  However,  the  restriction  to  tempered  measures  may  be  essential  when 
the  spaces  S*  are  noncompact  (see,  for  example,  [18,  section  5]  for  a  simple  but 
illuminating  example). 

Let  us  recall  that  a  norm  ||  •  ||  defined  on  an  algebra  of  square  (possibly  infinite) 
matrices  is  called  a  matrix  norm  if  ||AB||  <  ||A||  ||H||.  We  also  recall  that  the  matrix 
norms  ||  •  ||oo  and  ||  •  ||i  are  defined  for  nonnegative  matrices  A  =  (Aij)ij£j  as 

Plloo  :=  sup Alv  ||A||i  :=  sup ^  Aiy 

j£l  jeI  i£l 

The  following  result  collects  various  useful  alternatives  to  (6.1).  It  is  proved  in  Section 
C.4  in  Appendix  C. 

Corollary  6.8  (Alternatives  to  assumption  (6.1)).  Suppose  that  p  and  p  are  tem¬ 
pered.  Then  the  conclusion  of  Theorem  6.f  remains  valid  when  the  assumption  (6.1) 
is  replaced  by  one  of  the  following: 


1.  card /  <  oo  and  D  <  oo. 


2.  card /  <00 ,  R  <  00,  and  \\{W  1-R)n||  <  1  for  some  matrix  norm  ||  •  ||  and 
n  >  1. 

3.  sup iWu  <  00  and  ||W/~1i?|| 

OO  ^  1  * 

4-  sup iWa  <  oo;  || 1 1| 00  <  00 ,  and  || (i?W/^1)n||00  <  1  for  some  n>  1. 

5.  supj  Wu  <  00,  Yhi  ll^ill  <  °°>  and  ll-^^_1||i  <  1- 

6.  supj  Wa  <  00,  there  exists  a  metric  m  on  I  such  that  sup{m(i,  j)  :  Ri3  >  0}  < 

00  and  sup?;  <  00  for  all  (3  >  0,  and  || 1|  1  <  1. 

The  conditions  of  Corollary  6.8  are  closely  related  to  the  uniqueness  problem  for 
Gibbs  measures.  Suppose  that  the  collection  of  quasilocal  transition  kernels  (7J)jeg 
is  a  local  update  rule  for  p.  ft  is  natural  to  ask  whether  p  is  the  unique  measure  that 
admits  (7J)jeg  as  a  local  update  rule  (see  the  remark  at  the  end  of  Section  6.2).  We 
now  observe  that  uniqueness  is  a  necessary  condition  for  the  conclusion  of  Theorem 
6.4.  Indeed,  let  p  be  another  measure  that  admits  the  same  local  update  rule.  If  (6.1) 
holds,  we  can  apply  Theorem  6.4  with  7J  =  and  a3  =  0  to  conclude  that  p  =  p. 
In  particular,  ^ .(/  —  W  +  R)^  — *  0  in  Theorem  6.4  evidently  implies  uniqueness  in 
the  class  of  tempered  measures. 

Of  course,  the  point  of  Theorem  6.4  is  that  it  provides  a  quantitative  tool  that 
goes  far  beyond  qualitative  uniqueness  questions.  It  is  therefore  interesting  to  note 
that  this  single  result  nonetheless  captures  many  of  the  uniqueness  conditions  that 
are  used  in  the  literature.  In  Corollary  6.8,  Condition  3  is  precisely  the  “influence 
on  a  site”  condition  of  Weitz  [64,  Theorem  2.5]  (our  setting  is  even  more  general  in 
that  we  do  not  require  bounded-range  interactions  as  is  essential  in  [64]).  Conditions 
5  and  6  constitute  a  slight  strengthening  (see  below)  of  the  “influence  of  a  site” 
condition  of  Weitz  [64,  Theorem  2.7]  under  summable  metric  or  subexponential  graph 
assumptions,  in  the  spirit  of  the  classical  uniqueness  condition  of  Dobrushin  and 
Shlosman  [17].  In  the  finite  setting  with  single  site  updates,  Condition  2  is  in  the 
spirit  of  [  ]  and  Condition  4  is  in  the  spirit  of  [  ]. 

On  the  other  hand,  we  can  now  see  that  Theorem  6.4  provides  a  crucial  improve¬ 
ment  over  the  classical  comparison  theorem.  The  single  site  setting  of  Corollary  6.6 
corresponds  essentially  to  the  original  Dobrushin  uniqueness  regime  [  ].  It  is  well 

known  that  this  setting  is  restrictive,  in  that  it  captures  only  a  small  part  of  the 
parameter  space  where  uniqueness  of  Gibbs  measures  holds.  It  is  precisely  for  this 
reason  that  Dobrushin  and  Shlosman  introduced  their  improved  uniqueness  criterion 
in  terms  of  larger  blocks  [17],  which  in  many  cases  allows  to  capture  a  large  part  of 
or  even  the  entire  uniqueness  region;  see  [64,  section  5]  for  examples.  The  generalized 
comparison  Theorem  6.4  in  terms  of  larger  blocks  can  therefore  be  fruitfully  applied 
to  a  much  larger  and  more  natural  class  of  models  than  the  classical  comparison  the¬ 
orem.  This  point  is  further  emphasized  in  the  context  of  the  application  to  the  block 
particle  filter  in  Section  6.4. 

Remark  6.9.  The  “influence  of  a  site”  condition  ||-RW/~1||i  <  1  that  appears  in 
Corollary  6.8  is  slightly  stronger  than  the  corresponding  condition  of  Dobrushin- 
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Shlosman  [17]  and  Weitz  [64,  Theorem  2.7],  Writing  out  the  definition  of  R,  we 
find  that  our  condition  reads 


RW-'W,  =  sup  XVfij 1 


sup 

x,zS  S: 

xI\{j}=zI\{j} 


while  the  condition  of  [64,  Theorem  2.7]  (which  extends  the  condition  of  [17])  reads 


xi\{j}=zi\{j } 


The  latter  is  slightly  weaker  as  the  sum  over  sites  i  appears  inside  the  supremum 
over  configurations  x,z.  While  the  distinction  between  these  conditions  is  inessential 
in  many  applications,  there  do  exist  situations  in  which  the  weaker  condition  yields 
an  essential  improvement,  see,  e.g.,  [64,  section  5.3].  In  such  problems,  Theorem 
6.4  is  not  only  limited  by  the  stronger  uniqueness  condition  but  could  also  lead  to 
poor  quantitative  bounds,  as  the  comparison  bound  is  itself  expressed  in  terms  of  the 
uniform  influence  coefficients  R,j. 

It  could  therefore  be  of  interest  to  develop  comparison  theorems  that  are  able  to 
exploit  the  finer  structure  that  is  present  in  the  weaker  uniqueness  condition.  In 
fact,  the  proof  of  Theorem  6.4  already  indicates  a  natural  approach  to  such  improved 
bounds.  However,  the  resulting  comparison  theorems  are  necessarily  nonlinear  in  that 
the  action  of  the  matrix  R  is  replaced  by  a  nonlinear  operator  R.  The  nonlinear 
expressions  are  somewhat  difficult  to  handle  in  practice,  and  as  we  do  not  at  present 
have  a  compelling  application  for  such  bounds  we  do  not  pursue  this  direction  here. 
However,  for  completeness,  we  will  briefly  sketch  at  the  end  of  Section  C.2  how  such 
bounds  can  be  obtained. 

6.3.3  A  one-sided  comparison  theorem 

As  was  discussed  in  Section  6.2,  it  is  natural  in  many  applications  to  describe  high¬ 
dimensional  probability  distributions  in  terms  of  local  conditional  probabilities  of  the 
form  p(XJ  G  dzJ\XI\J  =  x I\J).  This  is  in  essence  a  static  picture,  where  we  describe 
the  behavior  of  each  local  region  J  given  that  the  configuration  of  the  remaining  sites 
I\J  is  frozen.  In  models  that  possess  dynamics,  this  description  is  not  very  natural. 
In  this  setting,  each  site  i  e  /  occurs  at  a  given  time  r(i),  and  its  state  is  only 
determined  by  the  configuration  of  sites  j  £  I  in  the  past  and  present  r(j)  <  r(i), 
but  not  by  the  future.  For  example,  the  model  might  be  defined  as  a  high-dimensional 
Markov  chain  whose  description  is  naturally  given  in  terms  of  one-sided  conditional 
probabilities  (see,  e.g.,  [23]).  It  is  therefore  interesting  to  note  that  the  original 
comparison  theorem  of  Dobrushin  [18]  is  actually  more  general  than  Corollary  6.6 
in  that  it  is  applicable  both  in  the  static  and  dynamic  settings  (see  the  one-sided 
Dobrushin  comparison  theorem,  Theorem  2.12).  We  presently  develop  an  analogous 
generalization  to  Theorem  6.4. 
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For  the  purposes  of  this  section,  we  assume  that  we  are  given  a  function  t  :  /  — >  Z 
that  assigns  to  each  site  i  G  /  an  integer  index  r(f).  We  define 

/<fc  :=  {i  G  /  :  r(f)  <  A;},  §<fc  :=  §^fe, 

and  for  any  probability  measure  p  on  §  we  denote  by  p<k  the  marginal  distribution 
on  §<*. 

Definition  6.10.  A  one-sided  local  update  rule  for  p  is  a  collection  (7 J)jeg  where 

1.  d  is  a  cover  of  I  such  that  min,e  j  r(i)  =  max* &Jr{i)  =:  r(J)  for  every  J  G 

2.  7J  is  a  transition  kernel  from  S<T(j)  to  §J. 

3.  p<T(j)  is  7 J  -invariant  for  every  J  G  d- 

The  canonical  example  of  a  one-sided  local  update  rule  is  to  consider  the  one-sided 
conditional  distributions  7 f(dzJ)  =  p(XJ  G  dzJ\XI^T(.J)\J  =  x1^T(-J'>\J).  This  situation 
is  particularly  useful  in  the  investigation  of  interacting  Markov  chains,  cf.  [18,  23], 
where  r (j )  denotes  the  time  index  of  the  site  j  and  we  condition  only  on  the  past 
and  present,  but  not  on  the  future. 

Definition  6.11.  A  one-sided  coupled  update  rule  for  (p,  p)  is  a  collection  of  tran¬ 
sition  kernels  (7J,  7J,  <5J,  QJ)  jea  such  that  the  following  hold: 

1.  (7 J)jea  and  (7 J)jeg  are  one-sided  local  update  rules  for  p  and  p,  respectively. 

2.  Qf  z  is  a  coupling  of  7^,  7/  for  J  G  d  and  17G  § <T(j )  with  cardjy  :  07  7^  Zi}  = 

1. 

3.  QJX  is  a  coupling  of  7^,7^  for  J  G  8  and  x  G  §<T(j). 

We  can  now  state  a  one-sided  counterpart  to  Theorem  6.4,  which  will  be  proved 
in  Section  C.5. 

Theorem  6.12  (General  comparison  theorem,  one-sided).  Let  [p/J ,  7J,  QJ ,  QJ)  jej 
be  a  one-sided  coupled  update  rule  for  (p,p),  and  let  be  a  family  of  strictly 

positive  weights.  Define  the  matrices  W  and  R  and  the  vector  a  as  in  Theorem  6.f. 
Assume  that  7J  is  quasilocal  for  every  J  G  d,  that 

OO 

Dt]  ( p  <g)  p)r)j  <  00  for  all  i  G  /  where  D  :=  (6.2) 

j£l  n=0 

and  that  (6.1)  holds.  Then  we  have 

I  pf  ~  pf\<Yl  OSCif  DP  Wjj1(lo 

i,j£l 

for  any  bounded  and  measurable  quasilocal  function  f  such  that  osc if  <  00  for  all 

i  G  I. 
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Let  us  remark  that  the  result  of  Theorem  6.12  is  formally  the  same  as  that  of 
Theorem  6.4,  except  that  we  have  changed  the  nature  of  the  update  rules  used  in  the 
definition  of  the  coefficients.  We  also  require  a  further  assumption  (6.2)  in  addition  to 
assumption  (6.1)  of  Theorem  6.4,  but  this  is  not  restrictive  in  practice:  in  particular, 
it  is  readily  verified  that  the  conclusion  of  Theorem  6.12  also  holds  under  any  of  the 
conditions  of  Corollary  6.8. 


6.4  Application:  block  particle  filter 


Our  original  motivation  for  developing  the  generalized  comparison  theorems  of  this 
chapter  was  the  investigation  of  algorithms  for  filtering  in  high  dimension.  In  this 
section  we  state  a  result  that  improve  qualitatively  Theorem  4.2 — the  main  result  of 
Chapter  4 — on  the  analysis  of  the  block  particle  filter. 

We  assume  to  be  in  the  same  set  up  of  Chapter  4,  and  we  refer  to  Section  4.4.2 
therein  for  a  discussion  that  motivates  the  importance  of  the  following  theorem.  The 
proof  of  this  result,  which  relies  crucially  on  the  generalized  comparison  theorems 
developed  in  this  chapter,  is  provided  in  Appendix  C,  Section  C.6. 

Theorem  6.13  (Block  particle  filter,  improved  version  of  Theorem  4.2).  For  any 
0  <  5  <  1  there  exists  0  <  Eo  <  1,  depending  only  on  5  and  A,  such  that  the  following 
holds.  Suppose  there  exist  Eq  <  e  <1  and  0  <  n  <  1  so  that 

eqv(xv,zv)  <pv(x,zv)  <  e~1qv(xv,zv), 
s  <  qv(x\zv)  <  r1, 

«  <  gv(xv,yv)  <  kc1 


for  every  v  E  V ,  x,  z  E  X,  y  G  Y,  where  qv  :  Xv  x  Xv  — »  M+  is  a  transition  density 
with  respect  to  ifv .  Then  for  every  n  >  0,  a  G  X,  K  G  %  and  J  C  K  we  have 


—  7T„ 


|  j  <  a  card  J 


e~Pi  d(J,9K)  + 


3^2|3C|c 

iW 


where  0  <  7  < 


and  0  <  a,/3i,/32  <  00  depend  only  on  5,  e,  k,  r,  A,  and  Ax- 


In  Theorem  6.13,  the  parameter  e  controls  the  spatial  correlations  while  the  pa¬ 
rameter  5  controls  the  temporal  correlations  (in  contrast  to  Theorem  4.2,  where  both 
are  controlled  simultaneously  by  e).  The  key  point  is  that  6  can  be  arbitrary,  and  only 
£  must  lie  above  the  threshold  e0.  That  the  threshold  e0  depends  on  6  is  natural:  the 
more  ergodic  the  dynamics,  the  more  spatial  interactions  can  be  tolerated  without 
losing  decay  of  correlations. 

The  proof  of  Theorem  4.2  was  based  on  repeated  application  of  the  classical  Do- 
brushin  comparison  theorem  (Corollary  6.6).  While  there  are  some  significant  dif¬ 
ferences  between  the  details  of  the  proofs,  the  essential  improvement  that  makes  it 
possible  to  prove  Theorem  6.13  is  that  we  can  now  exploit  the  generalized  comparison 
theorem  (Theorem  6.4),  which  enables  us  to  treat  the  spatial  and  temporal  degrees 
of  freedom  on  a  different  footing  (see  Section  C.6). 
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Chapter  7 

Nonlinear  filtering  in  infinite 
dimension 


This  chapter  is  devoted  to  showing  that  filtering  in  infinite  dimension  is  qualitatively 
different  from  filtering  in  finite  dimension.  We  show  that  new  phenomena  arise  in  the 
infinite-dimensional  setting,  specifically  that  inheritance  of  ergodicity  (in  the  form  of 
stability  or  decay  of  correlations)  can  undergo  a  phase  transition  in  the  signal-to-noise 
ratio.  The  qualitative  setting  of  this  chapter  is  complementary  to  the  quantitative 
framework  previously  considered  in  this  thesis.  The  material  here  presented  is  taken 
from  the  paper  [42],  which  further  develops  this  set  of  ideas  by  providing  conditions 
to  guarantee  inheritance  of  ergodicity. 


7.1  Motivations 

In  Chapter  4  and  Chapter  5  we  have  shown  that  local  filtering  algorithms  can  at¬ 
tain  dimension-free  approximation  errors  in  high- dimensional  models  that  exhibit 
conditional  decay  of  correlations.  The  natural  tool  to  capture  and  exploit  decay  of 
correlations  is  given  by  the  Dobrushin  comparison  theorem,  and  in  Chapter  6  we 
extended  this  machinery  by  introducing  more  general  comparison  theorems. 

The  framework  developed  in  the  previous  chapters  is  complementary  in  nature 
to  the  one  developed  in  the  present  chapter:  the  former  provide  quantitative  esti¬ 
mates  under  strong  (‘high-temperature’)  assumptions,  while  the  latter  focuses  on  the 
qualitative  understanding  of  ergodic  properties  of  the  filter  distribution.  In  fact,  as 
discussed  in  Section  4.4.1,  the  local  analysis  of  filtering  algorithms  that  we  have  de¬ 
veloped  relies  on  the  crucial  assumption  that  we  can  establish  proper  forms  of  filter 
stability  and  decay  of  correlations.  Presently,  we  address  the  fundamental  question 
of  the  inheritance  of  such  properties  upon  conditioning. 

To  discuss  the  topic  of  this  chapter,  let  (Xk,  Yk)k>0  be  a  bivariate  Markov  chain 
of  the  kind  considered  in  this  thesis.  Such  a  model  represents  the  setting  of  partial 
information:  it  is  presumed  that  only  (Yk)k> o  can  be  observed,  while  ( Xk)k>0  defines 
the  unobserved  dynamics.  In  order  to  understand  the  behavior  of  the  unobserved 
process  given  the  observations,  it  is  natural  to  “lift”  the  unobserved  dynamics  to  the 


103 


level  of  conditional  distributions,  that  is,  to  investigate  the  nonlinear  filter 

TTk:=P(Xke  ■  \Yl,...,Yk). 

Under  standard  assumptions  on  the  observation  structure,  the  process  ( 7ik)k>o  is  itself 
a  measure-valued  Markov  chain.  The  fundamental  question  that  arises  in  this  setting 
is  to  understand  in  what  manner  the  probabilistic  structure  of  the  model  (Xk,  Yk)k>0 
“lifts”  to  the  conditional  distributions  (7r/c)fc>0. 

Of  particular  interest  in  this  context  is  the  behavior  of  ergodic  properties  under 
conditioning.  It  is  natural  to  suppose  that  the  ergodic  properties  of  {Xk,Yk)k>0  will 
be  inherited  by  the  filter  (nk)k>o:  for  example,  if  Xk  forgets  its  initial  condition 
as  k  — »  oo,  then  the  optimal  mean-square  estimate  of  Xk  (and  therefore  the  filter 
7 rfc)  should  intuitively  possess  the  same  property.  Such  a  conclusion  was  already 
conjectured  by  Blackwell  as  early  as  1957  [5],  and  a  proof  was  provided  by  Kunita 
in  1971  [33].  Unfortunately,  both  the  proof  and  the  conclusion  are  erroneous:  it  is 
elementary  to  construct  a  finite-state  Markov  chain  (Xk,  Yk)k> o  that  is  1-dependent 
(as  strong  an  ergodic  property  as  one  could  hope  for)  with  observations  of  the  form 
Yk  =  h( Xk_i,Xk)  such  that  the  corresponding  filtering  process  (7Tk)k>o  is  nonergodic, 
see  Example  7.1  below.1 

Despite  the  appearance  of  counterexamples  already  in  the  most  elementary  setting, 
recent  advances  have  provided  a  surprisingly  complete  picture  of  such  problems  in  a 
general  setting.  On  the  one  hand,  it  has  been  shown  under  very  general  assumptions 
[57,  52]  that  ergodicity  of  the  underlying  model  is  inherited  by  the  filter  when  the 
observations  are  nondegenerate ,  that  is,  when  the  conditional  law  of  each  observation 
P (Yk  G  •  |X)  has  a  positive  density  with  respect  to  some  fixed  reference  measure. 
This  is  a  mild  condition  in  classical  filtering  models  that  serves  mainly  to  rule  out  the 
singular  case  of  noiseless  observations:  for  example,  the  addition  of  any  observation 
noise  to  the  above  counterexample  would  render  the  filter  ergodic.  On  the  other  hand, 
even  in  the  noiseless  case,  ergodicity  is  inherited  in  the  absence  of  certain  symmetries 
that  are  closely  related  to  systems-theoretic  notions  of  observability  [54,  56,  58,  9]. 
One  can  therefore  conclude  that  while  there  exist  elementary  examples  where  the 
ergodicity  of  the  model  fails  to  be  inherited  by  the  filter,  such  examples  must  be 
very  fragile  as  they  require  both  a  singular  observation  structure  and  the  presence  of 
unusual  symmetries,  either  of  which  is  readily  broken  by  a  small  perturbation  of  the 
model. 

The  theory  outlined  above  provides  a  satisfactory  understanding  of  conditional 
ergodicity  in  classical  filtering  models.  Some  care  must  be  taken,  however,  in  in¬ 
terpreting  this  conclusion.  The  ubiquitous  applicability  of  the  theory  hinges  on  the 
notion  that  most  filtering  models  possess  observation  densities,  an  assumption  made 
almost  universally  in  the  filtering  literature  (cf.  [13]  and  the  references  therein). 
This  assumption  is  largely  innocuous  in  finite-dimensional  systems.  The  situation  is 
entirely  different  in  infinite  dimension,  where  singularity  of  probability  measures  is 
the  norm.  There  exists  almost  no  mathematical  literature  on  filtering  in  infinite  di¬ 
mension,  despite  the  substantial  practical  importance  of  infinite-dimensional  filtering 

1  Surprisingly,  the  counterexample  (intended  for  a  different  purpose)  appears  in  Blackwell’s  own 
paper  [5]. 
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models  in  data  assimilation  problems  that  arise  in  areas  such  as  weather  forecasting 
or  geophysics  [19].  The  aim  of  this  chapter  is  to  draw  attention  to  the  fact  that, 
far  from  being  a  technical  issue,  the  infinite- dimensional  setting  gives  rise  to  new 
probabilistic  phenomena  and  questions  in  filtering  theory  that  are  fundamentally 
different  than  those  that  have  been  studied  in  the  literature  to  date,  and  whose 
understanding  remains  limited. 

To  model  a  filtering  problem  in  infinite  dimension  we  extend  the  framework  in¬ 
troduced  in  Section  4.1.  We  now  suppose  that  (Xk,Yk)k>0  is  a  Markov  chain  in  the 
product  state  space  Ev  x  Fv ,  where  E ,  F  arc  local  state  spaces  and  V  is  a  countably 
infinite  set  of  sites  (for  concreteness,  we  fix  V  =  Zd  throughout).  Each  element  of  V 
should  be  viewed  as  a  single  dimension  of  the  model.  A  more  practical  interpretation 
is  that  V  defines  a  spatial  degree  of  freedom  and  that  (Xk,Yk)k>0  describes  the  dy¬ 
namics  of  a  time- varying  random  held,  as  is  the  case  in  data  assimilation  applications. 
In  accordance  with  this  interpretation,  we  will  assume  that  the  dynamics  of  the  state 
Xk  and  the  observations  Yk  are  local  in  nature:  that  is,  the  conditional  distributions 
of  the  local  state  Xk  given  the  previous  state  Xk_i,  and  of  the  local  observation  Yf 
given  the  underlying  process  A",  depend  only  on  Xk_x  and  Xk  for  sites  w  E  V  that 
are  neighbors  of  v.  In  essence,  our  basic  model  therefore  consists  of  an  infinite  family 
of  local  filtering  models  (X%,  Yf)k>0  whose  dynamics  arc  locally  coupled  according  to 
the  graph  structure  of  V  —  7Ld. 

In  Section  7.2  we  review  the  classical  results  on  the  inheritance  of  filter  stability, 
and  we  discuss  Blackwell’s  Example  7.1.  In  Section  7.3  we  introduce  the  canoni¬ 
cal  infinite-dimensional  model  that  will  be  studied  in  this  chapter,  and  in  Section 
7.4  we  investigate  the  natural  infinite-dimensional  version  of  Blackwell’s  Example. 
Recall  that  it  was  crucial  in  the  finite-dimensional  setting  that  the  observations 
Yk  =  h( Xk_i,Xk)  are  noiseless:  the  addition  of  any  noise  renders  the  observations 
nondegenerate  and  then  ergodicity  is  preserved.  This  is  no  longer  the  case  in  infinite 
dimension:  even  if  the  local  observations  Yf  are  nondegenerate,  the  failure  of  the 
filter  to  inherit  ergodicity  can  persist.  In  fact,  we  observe  a  phase  transition:  the 
filter  fails  to  be  ergodic  when  the  noise  is  small,  but  becomes  ergodic  when  the  noise 
strength  exceeds  a  strictly  positive  threshold.  The  remarkable  feature  of  this  phe¬ 
nomenon  is  that  no  qualitative  change  of  any  kind  occurs  in  the  ergodic  properties  of 
the  underlying  model:  (Xk,  Yk)k>0jVeV  is  a  1-dependent  random  held  for  every  value 
of  the  noise  parameter.  We  are  therefore  in  the  surprising  situation  that  complex  er¬ 
godic  behavior  emerges  in  an  otherwise  trivial  model  when  we  consider  its  conditional 
distributions.  Such  conditional  phase  transitions  cannot  arise  in  finite  dimension. 

The  above  example  indicates  that  our  intuition  about  inheritance  of  ergodicity, 
which  fails  in  classical  filtering  models  only  in  pathological  cases,  cannot  be  taken 
for  granted  in  infinite  dimension  even  under  local  nondegeneracy  assumptions.  This 
raises  the  question  as  to  whether  there  are  situations  in  which  the  inheritance  of 
ergodicity  is  guaranteed.  In  view  of  the  finite-dimensional  theory,  in  Section  7.5  we 
conjecture  that  this  might  be  the  case  under  a  symmetry  breaking  assumption.  We 
refer  to  the  paper  [  ]  for  a  more  detailed  discussion  on  this  conjecture,  and  for  some 

positive  results  that  go  towards  proving  it. 
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In  Section  7.6  we  turn  our  attention  to  the  counterpart  of  the  filter  stability 
problem  in  the  setting  of  Markov  random  fields.  Such  problems  provide  a  simple 
setting  for  the  investigation  of  decay  of  correlations  in  filtering  problems,  and  are 
of  interest  in  their  own  right  as  models  that  arise,  for  example,  in  image  analysis 
[66,  26].  Here  the  natural  question  of  interest  is  whether  the  spatial  mixing  properties 
of  random  fields  arc  inherited  by  conditioning  on  local  observations.  Again,  we  refer 
to  the  paper  [42]  for  more  details  on  the  matter. 

7.2  Inheritance  of  ergodicity:  classical  results 

The  goal  of  this  section  is  to  set  up  the  basic  filtering  problem  that  will  be  studied  in 
the  sequel.  We  begin  by  defining  a  general  setting  for  nonlinear  filtering  that  slightly 
generalizes  the  one  introduced  in  Chapter  3,  and  we  introduce  and  discuss  the  basic 
ergodicity  question  to  be  studied. 

Throughout  this  chapter,  we  model  dynamics  with  partial  information  as  a  hidden 
Markov  models  where  (Xk,  Yk)k> o  is  a  Markov  chain  that  has  the  additional  property 
that  its  transition  kernel  factorizes  as 

P ((Xk,Yk)  e  =  I  lA(x,y)  P(Xk_1,dx)$(Xk_1,x,dy) 

for  given  transition  kernels  P  and  <f>:  the  factorization  corresponds  to  the  assumption 
that  (Xk)k> o  is  a  Markov  chain  in  its  own  right,  and  that  the  observations  (Yk)k> o 
are  conditionally  independent  given  (. Xk)k>0 .  More  general  settings  could  also  be 
considered,  see  [51]  for  instance. 

For  the  time  being,  we  assume  that  Xk  and  Yk  take  values  in  an  arbitrary  Polish 
space  (we  will  define  a  more  concrete  infinite- dimensional  setting  in  Section  7.3  below). 
The  nonlinear  filter  is  defined  as  the  regular  conditional  probability 

7ik-.=  P(Xke  ■\Y1,...,Yk). 

We  are  interested  in  the  question  of  whether  ( 7rk)k>o  inherits  the  ergodic  properties  of 
the  underlying  dynamics  (. Xk)k>Q .  There  are  several  different  but  closely  connected 
ways  to  make  this  question  precise  (cf.  Remark  7.3  below).  For  concreteness,  we  will 
focus  attention  on  one  particularly  elementary  formulation  of  this  question  that  will 
serve  as  the  guiding  problem  to  be  investigated  throughout  this  chapter. 

We  will  assume  in  the  sequel  that  the  Markov  chain  (W)fc>o  admits  a  unique 
invariant  measure  A.  As  P (Xk,Yk  e  -\Xk_i,  Yk_\)  does  not  depend  on  Yk- 1  due  to 
the  hidden  Markov  structure,  the  invariant  measure  A  extends  uniquely  to  an  invariant 
measure  for  the  chain  (Xk,Yk)k>0,  and  we  denote  the  unique  stationary  law  of  this 
process  as  P.  By  stationarity,  we  can  assume  in  the  sequel  that  (Xk,  Yk)ke z  is  defined 
also  for  k  <  0. 

Throughout  this  chapter,  the  ergodic  property  of  (. Xk)k>0  that  we  will  consider  is 
stability  in  the  sense  that 

\P(Xk  e  A\X0)  -  A(A)|  0  in  L 1 
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for  every  measurable  set  A:  that  is,  the  law  of  Xk  “forgets”  the  initial  condition  A"o 
as  k  — *  oo.  The  analogous  conditional  property  is  filter  stability  in  the  sense  that 

|P(xfc  e  a\xq,  hi, ,  Yk)  -  P{xk  e  A|y1; . . . ,  Yk)\  ^  o  in  l 1 

for  every  measurable  set  A:  that  is,  the  conditional  distribution  of  Xk  given  the  ob¬ 
served  data  “forgets”  the  initial  condition  X0  as  k  — »  oo.  It  is  natural  to  suppose  that 
stability  of  the  underlying  dynamics  will  imply  stability  of  the  filter.  This  conclusion 
is  incorrect,  however,  as  is  illustrated  by  the  following  classical  example  [  ]. 

Example  7.1  (Blackwell’s  example).  Let  ( Xk)k>o  be  an  i.i.d.  sequence  of  random 
variables  with  P(Xk  =  1)  =  P(Xk  =  —1)  =  1/2,  and  let  Yk  =  XkXk-i  for  k  >  1. 
This  evidently  defines  a  stationary  hidden  Markov  model  with  P(x,  •)  =  (hi  +  <5_i)/2 
and  &(x',x,  ■)  =  Sxx>.  Note  that 


Xk  =  X0 YAf  ■■■¥,. 

We  can  therefore  easily  compute  for  every  k  >  0 

P(Xfc  =  1|X0,  Pi, ... ,  Yk)  =  lXfc=i, 

P(Xk  =  l\Yu...,Yk)  =  1/2. 

Thus  the  filter  is  certainly  not  stable.  On  the  other  hand,  underlying  dynamics  (Xk)k>o 
is  an  i.i.d.  sequence,  and  is  therefore  stable  in  the  strongest  possible  sense: 

P(Xk  e  A\X0)  =  \(A)  for  all  k  >  1. 

Moreover,  even  the  process  {Xk,Yk)k>0  is  stable  in  the  strongest  possible  sense:  it  is 
a  1-dependent  sequence,  so  that  P ((Xk,Yk)  e  A\X0,Y0)  =  P(fXk,Yk)  e  A)  for  all 
k>  2. 

Example  7.1  shows  that  the  inheritance  of  ergodicity  under  conditioning  cannot 
be  taken  for  granted.  Nonetheless,  the  phenomenon  exhibited  here  is  very  fragile:  if 
the  observations  are  perturbed  by  any  noise  (for  example,  if  we  set  Yk  =  XkXk-ifk 
with  P(fk  =  —  1)  =  1  —  P(£fc  =  1)  =  p  and  any  0  <  p  <  1),  the  filter  will  become 
stable.  The  inheritance  of  ergodicity  is  therefore  apparently  obstructed  by  the  singu¬ 
larity  of  the  observation  kernel  $.  To  rule  out  such  singular  behavior,  it  is  natural 
to  require  that  the  observation  kernel  $  possesses  a  positive  density  with  respect  to 
some  reference  measure  tp.  A  model  with  this  property  is  said  to  possess  nondegen¬ 
erate  observations.  One  might  now  expect  that  nondegeneracy  of  the  observations 
removes  the  obstruction  to  inheritance  of  ergodicity  observed  in  Example  7.1.  Un¬ 
fortunately,  this  is  still  not  the  case  in  complete  generality,  as  is  demonstrated  by 
an  esoteric  counterexample  in  [59].  However,  the  conclusion  does  hold  if  we  use  a 
stronger  uniform  notion  of  stability. 

Theorem  7.2  (Inheritance  of  stability  [5  ]).  Suppose  that  the  following  hold. 
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1.  The  underlying  dynamics  is  uniformly  stable  in  the  sense 


sup  \P{Xk  G  A\X0)  -  A(H)|  0  in  L1. 

A 


2.  The  observations  are  nondegenerate  in  the  sense 

<£>(y ,  x,  dy )  =  g(x',  x,  y )  tp(dy),  g(x',  x,  y)  >  0  for  all  x,  x',  y. 


Then  the  filter  is  uniformly  stable  in  the  sense 


sup  \P(Xk  G  A\X0,  Yi, . . . ,  Yk)  -  P(Xk  G  A|Yi, . . . ,  Yk)\  ^  0  m  L1. 

A 

This  result,  together  with  the  mathematical  theory  behind  its  proof  provides  a  very 
general  qualitative  understanding  of  the  inheritance  of  ergodicity  in  classical  filtering 
models.  However,  as  will  be  explained  below,  this  theory  breaks  down  completely  in 
infinite-dimensional  models.  In  the  remainder  of  this  chapter,  we  will  see  that  new 
phenomena  arise  in  the  infinite-dimensional  setting. 

Remark  7.3  (Different  formulations  of  filter  stability).  The  question  of  inheritance 
of  ergodic  properties  under  conditioning  can  be  formulated  in  a  number  of  different 
ways.  For  concreteness,  we  focus  our  attention  in  this  chapter  on  the  elementary 
formulation  introduced  above.  As  the  choice  of  problem  is  somewhat  arbitrary,  let  us 
briefly  describe  a  number  of  alternative  formulations. 

In  the  setting  of  stability  of  the  filter,  we  have  considered  “forgetting”  of  the  initial 
condition  X0  under  the  stationary  measure.  Similar  problems  can  be  formulated, 
however,  in  a  more  general  setting.  Denote  by  the  law  of  the  process  (Xk,  Yk)k> 0 
with  the  initial  distribution  A"0  ~  g.  A  natural  notion  of  stability  is  to  require  that 

P^(Xk  G  ■)  \  for  every  g 

in  a  suitable  topology  on  probability  measures.  If  we  define  the  filter  started  at  g 
as  7Tfc  :=  P^l(Xk  G  -|Yi, . . . ,  Yk),  we  can  now  investigate  the  general  filter  stability 
problem 

!<(/)-<(/)!  ^  0  inL\P^) 

for  a  suitable  class  of  measures  g,  u,  7  and  functions  f.  The  formulation  that  we 
consider  in  this  chapter  corresponds  to  the  special  case  u  —  A  and  g  =  7  =  5X  for  x 
outside  a  X-null  set.  Nonetheless,  our  formulation  proves  to  be  equivalent  in  a  rather 
general  setting  to  stability  for  general  initial  measures  g,  u,  7,  cf.  [13,  Chapter  12]  and 

[57,  52]. 

A  different  and  perhaps  more  natural  formulation  dates  back  to  Blackwell  [5]  and 
Kunita  [33],  Using  the  Markov  property  of  the  underlying  model,  it  is  not  difficult 
to  show  that  the  measure-valued  stochastic  process  ( 7Tk)k>o  is  itself  a  Markov  chain, 
cf.  [59,  Appendix  A].  One  can  now  ask  whether  the  ergodic  properties  of  the  Markov 
chain  (. Xk)k>0  “lift”  to  ergodic  properties  of  the  Markov  chain  (7rfc)fc>0.  For  example,  if 
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{Xk)k>0  admits  a  unique  stationary  measure,  does  ( Ttk)k>o  admit  a  unique  stationary 
measure  also?  Similarly,  if  (Xk)k>0  converges  to  its  stationary  measure  starting  from 
any  initial  condition,  does  the  same  property  hold  for  ( vrfc)fc>0  ?  Remarkably,  while 
these  questions  appear  in  first  instance  to  be  quite  distinct  from  the  question  of  filter 
stability,  such  properties  again  prove  to  be  equivalent  in  a  very  general  setting  to  the 
notion  of  filter  stability  that  we  consider  in  this  chapter,  cf.  [33,  f8,  9,  13,  59]. 

A  third  formulation  of  inheritance  of  ergodicity  under  conditioning  is  obtained 
when  we  consider,  rather  than  the  filter,  the  conditional  distribution  of  the  entire 
process  X  =  (Xk)keZ  given  the  infinite  observation  sequence  Y  =  (Yk)keZ.  Using  the 
Markov  property  of  the  underlying  model,  it  is  not  difficult  to  establish  that  X  is  still  a 
Markov  process  under  the  conditional  distribution  P(  •  | Y),  albeit  time-inhomogeneous 
and  with  transition  probabilities  that  depend  on  the  realized  observation  sequence  Y : 
that  is,  the  conditional  process  is  a  Markov  chain  in  a  random  environment.  One 
can  now  ask  whether  the  process  X  inherits  its  ergodic  properties  under  P  when  it 
is  considered  under  the  conditional  distribution  P(  ■  |F).  Once  again,  this  apparently 
distinct  formulation  proves  to  be  equivalent  in  a  general  setting  the  formulation  con¬ 
sidered  in  this  chapter,  a  fact  that  is  exploited  heavily  in  the  theory  of  [57,  52]. 

R  is  now  well  understood  that  the  properties  described  above  are  equivalent  in  clas¬ 
sical  filtering  models.  While  some  of  these  arguments  extend  directly  to  the  infinite¬ 
dimensional  setting,  others  do  not,  and  it  remains  to  be  investigated  to  what  extent 
these  equivalences  remain  valid  in  infinite  dimension.  Nonetheless,  the  problem  for¬ 
mulation  considered  here  is  arguably  the  most  elementary  one,  and  provides  a  natural 
starting  point  for  the  investigation  of  conditional  phenomena  in  infinite  dimension. 

Remark  7.4  (On  observability).  Even  when  the  underlying  dynamics  (. Xk)k>0  is  not 
stable,  it  may  be  the  case  that  the  filter  is  stable.  For  example,  using  the  trivial  obser¬ 
vation  model  Yk  =  Xk,  the  filter  is  stable  regardless  of  any  properties  of  the  underlying 
model.  More  generally,  the  filter  is  expected  to  be  stable  when  the  observations  are 
“sufficiently  informative,  ”  which  is  made  precise  in  [54,  56,  58]  in  terms  of  nonlinear 
notions  of  observability.  Such  results  are  in  some  sense  the  opposite  of  Theorem  7.2: 
the  latter  shows  that  ergodicity  is  inherited  by  the  filter,  while  the  former  show  that 
the  filter  can  be  ergodic  regardless  of  ergodicity  of  the  underlying  model  ( even  without 
nondegeneracy).  None  of  these  results  prove  to  be  satisfactory  in  infinite  dimension: 
it  appears  that  a  general  theory  for  ergodicity  of  the  filter  will  require  both  ergodicity 
of  the  underlying  model  and  some  form  of  observability,  as  will  become  evident  in  the 
following  sections. 

7.3  The  infinite-dimensional  model 

The  aim  of  this  chapter  is  to  show  that  new  phenomena  arise  in  filtering  theory  in 
infinite  dimension.  So  far,  no  assumptions  have  been  made  on  the  model  dimension: 
we  have  set  up  our  theory  in  any  Polish  state  space.  Nonetheless,  while  no  explicit 
dimensionality  requirements  appear,  for  example,  in  Theorem  7.2,  the  assumptions 
of  previous  results  can  typically  hold  only  in  finite-dimensional  situations.  To  under¬ 
stand  the  problems  that  arise  in  infinite  dimension,  and  to  provide  a  concrete  setting 
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for  the  investigation  of  conditional  phenomena  in  infinite  dimension,  we  presently  in¬ 
troduce  a  canonical  infinite-dimensional  filtering  model  that  will  be  used  in  the  sequel 
(this  model  represents  a  generalization  of  the  finite-dimensional  model  considered  in 
Section  4.1). 

The  practical  interest  in  infinite- dimensional  filtering  models  stems  from  problems 
that  have  spatial  in  addition  to  dynamical  structure.  To  model  this  situation,  let  us 
assume  for  concreteness  that  the  spatial  degrees  of  freedom  are  indexed  by  the  infinite 
lattice  Zd.  We  also  define  Polish  spaces  E  and  F  that  describe  the  state  of  the  model 
at  each  spatial  location.  We  now  assume  that  Xk  and  Yk  are  random  fields  that  are 
indexed  by  Zd  and  take  values  locally  in  E  and  F,  respectively,  for  every  time  k:  that 
is, 

Xk  =  (Xvk)v&d  G  Ezd  and  Yk  =  ( Y£)v&d  G  Fzd . 

Each  v  G  7Ld  should  be  viewed  as  a  single  “dimension”  of  the  model.2  We  now  define  a 
hidden  Markov  model  that  respects  the  spatial  structure  of  the  problem  by  assuming 
that  both  the  underlying  dynamics  and  the  observations  are  local:  that  is,  we  assume 
that  the  transition  and  observation  kernels  P  and  <f>  factorize  as 

P(x,dz)  =  Pv(x,dzv ),  $(x,z,dy)  =  $v(x,  z,dyv), 

vezd  v&A 


where 


Pv(x,  A)  and  &v(x,  z,  B )  depend  only  on  xw,  zw  for  ||w  —  u||  <  1. 

Such  a  model  should  be  viewed  as  a  hidden  Markov  model  counterpart  of  probabilistic 
cellular  automata  [35]  or  interacting  particle  systems  [36]  that  have  been  widely  in¬ 
vestigated  in  the  literature  as  natural  models  of  space-time  dynamics.  Alternatively, 
one  might  view  such  a  model  as  an  infinite  collection  {X%,Yk)k>0  of  hidden  Markov 
models  whose  dynamics  and  observations  are  locally  coupled  to  their  neighbors  in  Zd. 

While  problems  of  this  type  have  been  rarely  considered  in  filtering  theory,  the 
infinite-dimensional  model  that  we  have  formulated  is  in  principle  a  special  case  of 
the  general  model  described  in  the  previous  section.  However,  its  structure  is  such 
that  the  assumptions  of  a  result  such  as  Theorem  7.2  typically  cannot  hold.  Let 
us  consider,  for  example,  the  setting  where  each  local  observation  Yv  has  a  positive 
density  of  the  form  &v(x,  z,dyv)  =  g(zv ,yv)  ip(dyv),  so  that  the  observations  are 
locally  nondegenerate.  Choose  two  values  e,e'  G  E  such  that  g(e,-)  ^  g(e',-),  and 
define  the  constant  configurations  z,  z1  as  zv  =  e  and  z'v  =  e!  for  all  v  G  Zd.  Then 
the  measures  <E>(x,  z,  •)  and  &(x,z',  •)  are  two  distinct  laws  of  an  infinite  number  of 
i.i.d.  random  variables,  and  are  therefore  mutually  singular  (cf.  Proposition  2.14). 
This  immediately  rules  out  the  possibility  that  the  observations  are  nondegenerate 
in  the  sense  of  Theorem  7.2.  It  is  precisely  this  problem  that  lies  at  the  heart  of  the 

2  The  present  setting  is  easily  extended  to  the  setting  of  more  general  locally  finite  graphs  and 
to  the  setting  where  each  location  v  may  possess  a  different  local  state  space  Ev .  Such  an  extension 
does  not  illuminate  significantly  the  phenomena  that  will  be  investigated  in  the  sequel.  On  the 
other  hand,  a  nontrivial  extension  of  substantial  interest  in  applications  is  to  continuous  infinite¬ 
dimensional  models  such  as  stochastic  partial  differential  equations,  cf.  [49]. 
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difficulties  in  infinite-dimensional  models:  probability  measures  in  infinite  dimension 
are  typically  mutually  singular,  even  when  they  admit  densities  locally  (that  is,  for 
any  finite-dimensional  marginal);  see  Section  2.4.  In  the  absence  of  densities,  classical 
results  in  filtering  theory  cannot  be  taken  for  granted,  and  the  study  of  filtering 
in  infinite  dimension  gives  rise  to  fundamentally  different  problems  than  have  been 
studied  in  the  literature  to  date.  We  initiate  the  investigation  of  such  problems  in 
the  sequel. 

Remark  7.5  (Observations:  the  problem  in  infinite  dimension).  The  singularity 
of  measures  in  infinite  dimension  is  problematic  not  only  for  the  nondegeneracy  of 
observations,  but  also  for  the  ergodic  theory  of  Markov  chains.  For  example,  the 
uniform  stability  property  in  Theorem  7.2  will  rarely  hold  in  infinite  dimension:  it  is 
often  the  case  that  the  law  of  Xk  is  singular  with  respect  to  A  for  all  k  <  oo,  which 
rules  out  total  variation  convergence  (see  [52,  Example  2.3]  for  a  simple  illustration). 
However,  this  issue  is  surmounted  in  [52]  using  a  form  of  localization:  by  performing 
the  analysis  of  Theorem  7.2  locally  (that  is,  to  finite- dimensional  projections  of  the 
original  model),  we  can  avoid  the  singularity  of  the  full  infinite- dimensional  problem. 
This  allows  to  extend  the  conclusion  of  Theorem  7.2  to  a  wide  range  of  infinite¬ 
dimensional  models  with  nondegenerate  observations.  In  practice,  this  implies  that 
much  of  the  classical  filtering  theory  extends,  at  least  in  spirit,  to  models  where  Xk 
is  infinite- dimensional  but  Yk  is  (effectively)  finite-dimensional.  It  is  only  when  the 
observations  Yk  are  also  infinite-dimensional  that  new  phenomena  arise. 

Remark  7.6  (On  infinite-dimensional  models).  Let  us  note  that  we  have  used  the 
term  “infinite- dimensional”  to  denote  the  situation  where  there  are  infinitely  many 
independent  degrees  of  freedom,  which  is  the  key  issue  in  our  setting.  The  problem  of 
dimension  is  unrelated  to  the  linear  algebraic  or  metric  dimension  of  the  state  space: 
indeed,  even  each  of  the  local  state  spaces  E  and  F  in  our  model  can  itself  be  an 
arbitrary  Polish  space.  Conversely,  it  is  possible  to  have  infinite- dimensional  systems 
that  are  “effectively  finite- dimensional”  in  the  sense  that  only  finitely  many  degrees 
of  freedom  carry  significant  information.  This  is  common,  for  example,  in  stochastic 
partial  differential  equations  (see,  e.g.,  [52]).  See  also  Section  2.f. 

At  the  same  time,  it  should  be  noted  that  even  in  finite- dimensional  systems  where 
results  such  as  Theorem  7.2  technically  apply,  the  qualitative  information  contained 
in  such  statements  may  be  misleading  from  the  practical  point  of  view:  in  finite  but 
high- dimensional  systems,  phenomena  that  arise  qualitatively  in  infinite  dimension 
are  still  manifested  in  a  quantitative  fashion  (see  Chapter  f  for  quantitative  results 
and  discussion  on  filtering  in  high  dimension) .  For  example,  if  the  filter  is  not  stable 
for  the  infinite- dimensional  model,  it  will  often  still  be  the  case  that  the  filter  is  stable 
for  every  finite- dimensional  truncation  of  the  model;  however,  the  quantitative  rate 
of  stability  will  vanish  rapidly  as  the  dimension  is  increased.  Conversely,  if  the  filter 
is  stable  for  the  infinite- dimensional  model,  then  the  rate  of  stability  of  the  filter  for 
the  finite-dimensional  models  will  be  dimension-free.  As  it  is  ultimately  the  quantita¬ 
tive  behavior  of  filtering  algorithms  that  is  of  importance  in  practice,  the  qualitative 
phenomena  investigated  here  in  infinite  dimension  can  still  provide  more  insight  into 
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the  behavior  of  practical  filtering  problems  in  high  dimension  than  classical  results  in 
filtering  theory. 


7.4  A  conditional  phase  transition 

We  now  develop  a  simple  example  of  the  general  infinite-dimensional  setting  of  Section 
7.3  where  we  observe  nontrivial  behavior  of  the  inheritance  of  ergodicity.  This  model, 
to  be  described  presently,  is  a  natural  infinite-dimensional  variation  on  Blackwell’s 
counterexample  (Example  7.1  above). 

Throughout  this  section, 

=  {Xl)v&  e  {-1, 1}Z  and  Yk  =  (Yfi,  Yf)v&  e  ({-1, 1}  x  {-1, 1  })z 
are  binary  random  helds  in  one  spatial  dimension.  We  let 

(X 'l)k,v&  are  i.i.d.  with  P(Xf  =  1)  =  1/2, 

and  we  let 

\rv  yu  Y~v  Cv  Xrv  Vv  Vv~\~^-Cv 

xk  ~  ^k^k-l^kt  xk  ~  ^k^k  Sfc? 

where 

(Ck)k,v Gz,  (ik)k,vez  are  i.i.d.  with  P(££  =  -1)  =  p 

and  (£%)k,vez,  (£%)&, v ez  are  independent  of  (X%)ktVeZ. 

This  evidently  corresponds  to  a  model  of  the  form  discussed  in  Section  7.3.  In 
words,  the  underlying  dynamics  is  of  the  simplest  possible  type:  each  time  and  each 
spatial  location  is  an  independent  random  variable.  When  p  —  0,  the  observations 
reveal  for  each  site  whether  its  current  state  differs  from  its  state  at  the  previous 
time  and  from  the  states  of  its  two  neighbors  at  the  present  time.  When  p  >  0,  each 
observation  is  subject  to  additional  noise  that  inverts  the  outcome  with  probability 
p.  By  symmetry,  it  will  suffice  to  consider  the  case  p  <  1/2,  which  we  will  do  from 
now  on. 

The  model  that  we  have  constructed  is  evidently  a  direct  extension  of  Example 
7.1  to  infinite  dimension.  As  in  Example  7.1,  the  process  {Xk,Yk)keZ  is  ergodic  in 
the  strongest  sense,  so  that  even  the  uniform  stability  assumption  of  Theorem  7.2  is 
satisfied.  When  p  —  0,  it  is  easily  seen  by  the  same  reasoning  as  in  Example  7.1  that 
the  filter  is  not  stable.  However,  in  Example  7.1  the  addition  of  observation  noise 
with  error  probability  p  >  0  would  yield  nondegenerate  observations,  and  thus  filter 
stability  by  Theorem  7.2.  In  the  present  setting,  on  the  other  hand,  nondegeneracy 
fails  for  any  p.  Nonetheless,  the  observations  are  locally  nondegenerate  when  p  >  0, 
and  one  might  conjecture  that  this  suffices  to  ensure  inheritance  of  ergodicity.  This 
is  not  the  case. 

Theorem  7.7  (Inheritance  of  stability,  phase  transition).  For  the  model  of  this 
section,  there  exist  constants  0  <  p*  <  p*  <  1/2  such  that  the  filter  is  stable  for 
p*  <  p  <  1/2  and  is  not  stable  for  0  <  p  <  p*. 


112 


We  refer  to  Appendix  D  for  the  proof  of  Theorem  7.7.  The  proof  relies  on  standard 
tools  from  statistical  mechanics  [7,  27]:  a  Peierls  argument  for  the  low  noise  regime 
and  a  Dobrushin  contraction  method  for  the  high  noise  regime. 

Remark  7.8.  We  naturally  believe  that  one  can  choose  p*  =  p *  in  Theorem  7.7,  but 
we  did  not  succeed  in  proving  that.  The  proof  yields  some  explicit  bounds  on  p*  and 

p* . 


Theorem  7.7  shows  that  local  nondegeneracy  does  not  suffice  to  ensure  inheri¬ 
tance  of  ergodicity  in  infinite  dimension:  ergodicity  of  the  filter  undergoes  a  phase 
transition  at  a  strictly  positive  signal  to  noise  ratio  of  the  observations.  Remarkably, 
the  underlying  model  does  not  seem  to  exhibit  any  qualitative  change  in  behavior: 
(XL  YD  k,V£Z  is  a  one-dependent  random  held  for  every  value  of  the  error  probability 
p.  Thus  it  is  evidently  possible  in  infinite  dimension  that  complex  ergodic  behavior 
emerges  in  an  otherwise  trivial  model  when  we  consider  its  conditional  distributions. 


7.5  Conjecture  on  inheritance  of  stability 

Theorem  7.7  shows  that  inheritance  of  ergodicity  under  conditioning  cannot  be  taken 
for  granted  in  infinite  dimension  even  when  the  model  is  locally  nondegenerate.  Are 
such  phenomena  prevalent  in  infinite  dimension,  or  are  they  restricted  to  some  care¬ 
fully  constructed  examples?  We  would  like  to  understand  in  what  situations  such 
phenomena  can  be  ruled  out,  both  from  the  mathematical  perspective  and  in  view 
of  the  importance  of  filter  stability  (as  well  as  spatial  decay  of  correlations  in  infinite 
dimension)  for  the  performance  of  practical  filtering  algorithms,  as  seen  in  Chapter  4 
and  Chapter  5. 

It  is  not  difficult  to  understand  the  mechanism  that  causes  the  filter  to  be  un¬ 
stable  in  Theorem  7.7.  In  this  model,  the  observations  possess  a  global  symmetry: 
the  conditional  law  of  Y  is  unchanged  under  the  transformation  X  i— >  —A".  This 
symmetry  renders  the  filter  trivially  unstable  in  the  absence  of  observation  noise,  in 
precise  analogy  with  Example  7.1.  In  the  finite-dimensional  case,  however,  Theorem 
7.2  shows  that  the  addition  of  any  observation  noise  suffices  to  ensure  that  ergod¬ 
icity  of  the  underlying  model  is  not  broken  by  the  additional  symmetry  introduced 
by  conditioning.  The  surprise  in  infinite  dimension  is  that  the  qualitative  effect  of 
the  added  symmetry  still  persists  in  the  presence  of  observation  noise.  Thus  local 
nondegeneracy  in  itself  does  not  suffice  to  ensure  the  inheritance  of  ergodicity  under 
conditioning. 

On  the  other  hand,  the  phenomenon  exhibited  in  Theorem  7.7  evidently  cannot 
arise  in  models  that  do  not  possess  observation  symmetries.  It  seems  natural  to 
conjecture  that  the  presence  of  such  symmetries  is  the  only  possible  obstruction  to 
inheritance  of  ergodicity  under  conditioning:  that  is,  inheritance  of  ergodicity  is  en¬ 
sured  once  observation  symmetries  are  ruled  out.  It  is  not  entirely  obvious,  however, 
how  such  a  principle  can  be  rigorously  formulated.  On  the  other  hand,  even  in  the 
absence  of  a  general  definition,  this  intuitive  notion  should  certainly  be  satisfied  in 
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many  elementary  observation  models.  For  example,  let  us  state  the  following  sim¬ 
ple  conjecture,  which  encapsulates  the  essence  of  the  above  intuition  in  the  simplest 
possible  setting. 

Conjecture  7.9.  Let  (Xk,Yk)kez  be  a  stationary  infinite- dimensional  hidden  Markov 
model  as  in  Section  7.3  with  X k  e  {  —  1, 1}Z  and  with  Yk  e  {  —  1, 1}Z  of  the  form 

Yfi  =  XfCk,  (Ck)k,vez  are  i.i.d.  X  X  with  P(&  =  -1)  =  p. 

If  the  underlying  process  (Xk)kez  is  stable,  then  the  filter  is  stable. 

The  idea  behind  this  conjecture  is  that  the  direct  observation  structure  Yf  =  X%£% 
is  evidently  devoid  of  symmetries  for  any  p  \.  every  configuration  x  €  {  — 1, 1}Z 
gives  rise  to  a  distinct  observation  law  P(Y*.  £  •  \Xk  =  x)  (the  case  p—\  is  trivial  as 
then  Y  X  X;  we  will  therefore  assume  p  ^  |  in  the  sequel).  Thus  any  mechanism  of 
the  type  exhibited  by  Theorem  7.7  is  ruled  out,  and  it  seems  hard  to  imagine  another 
mechanism  by  which  ergodicity  of  the  underlying  process  could  be  obstructed  due  to 
conditioning  on  such  informative  observations.  Despite  the  seemingly  obvious  nature 
of  this  conjecture,  we  were  not  able  to  prove  such  a  result  in  a  general  setting. 

The  idea  that  stability  of  the  filter  is  related  to  the  absence  of  symmetries  is  not 
new  in  the  infinite-dimensional  setting.  It  arises  already  in  classical  filtering  models 
for  a  somewhat  different  reason:  it  may  happen  that  the  filter  is  stable  even  when  the 
underlying  model  is  not  ergodic.  In  such  situations,  stability  properties  can  emerge 
under  the  conditional  distribution  due  to  the  informative  nature  of  the  observations; 
in  essence,  the  filter  will  “forget”  its  initial  distribution  as  the  information  contained 
therein  is  superseded  by  the  information  in  the  observations.  This  phenomenon  was 
made  precise  in  the  papers  [54,  56,  58].  While  the  theory  developed  in  these  papers 
is  closely  related  to  the  symmetry  breaking  properties  that  we  aim  to  exploit  here, 
these  results  are  not  satisfactory  in  infinite  dimension. 

In  the  paper  [42]  we  extend  such  observability  arguments  to  translation-invariant 
systems  in  infinite  dimension  by  exploiting  a  technique  from  multidimensional  ergodic 
theory  [12].  Somewhat  surprisingly,  the  problem  proves  to  be  more  tractable  in 
the  continuous-time  setting,  for  which  will  establish  validity  of  the  natural  analogue 
of  Conjecture  7.9.  In  its  original  discrete  time  formulation,  however,  our  ultimate 
result  falls  short  of  establishing  Conjecture  7.9  even  for  translation-invariant  models. 
Nonetheless,  the  theory  developed  here  provides  one  possible  mechanism  for  symmetry 
breaking  in  conditional  ergodic  theory. 


7.6  Conditional  random  fields 

Thus  far  we  have  considered  infinite-dimensional  counterparts  of  classical  stability 
problems  in  nonlinear  filtering.  However,  new  questions  arise  in  infinite  dimension 
beyond  stability  that  are  of  interest  in  their  own  right.  In  particular,  for  the  theory 
developed  in  Chapter  4  and  Chapter  5  it  is  of  significant  interest  to  understand 
the  spatial  mixing  and  decay  of  correlations  properties  of  conditional  distributions 
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in  infinite  dimension,  which  could  be  viewed  as  spatial  counterparts  to  the  filter 
stability  property.  Such  questions  already  arise  in  the  absence  of  dynamics,  and  thus 
we  proceed  in  this  section  to  introduce  such  problems  in  the  most  basic  setting  of 
conditional  random  fields  (that  is,  in  models  with  only  spatial  degrees  of  freedom). 
Our  motivations  for  such  questions  are  threefold: 

1.  Random  fields  provide  the  simplest  possible  setting  to  investigate  the  spatial 
mixing  properties  of  conditional  distributions. 

2.  Conditional  random  fields  are  of  practical  interest  in  their  own  right,  for  exam¬ 
ple,  in  Bayesian  image  analysis  applications  [66,  26]. 

3.  Even  in  the  more  classical  setting  of  the  previous  sections,  the  random  field 
viewpoint  proves  to  be  fundamental  to  the  understanding  of  filter  stability  in 
infinite  dimension:  indeed,  the  proofs  in  Section  7.4  and  in  Chapter  4  and 
Chapter  5  exploit  the  idea  that  (X%,  Y(f)kez,veZd  can  be  viewed  as  a  space-time 
random  field. 

The  remainder  of  this  chapter  is  organized  as  follows.  In  Section  7.6.1,  we  recall 
some  basic  notions  from  the  theory  of  Markov  random  fields.  In  Section  7.6.2,  we  de¬ 
velop  basic  properties  of  conditional  random  fields  and  introduce  some  of  the  relevant 
questions. 

7.6.1  Markov  random  fields 

A  random  field  is  a  collection  of  random  variables  Xv  that  are  indexed  by  the  spatial 
degree  of  freedom  v.  For  simplicity,  we  will  assume  in  the  sequel  that  v  G  and 
that  each  Xv  takes  values  in  a  finite  set  E. 

In  the  following,  we  define  for  any  V  C  Zd 

Vc:=Zd\V. ,  dV  :=  {w  e  Vc  :  ||u  — iu||  =  1  for  some  v  E  V},  Xv  :=  (Xv)veV. 

If  V  is  a  finite  subset  of  Zd,  we  will  write  V  CC  Zd.  We  now  recall  a  basic  definition. 

Definition  7.10.  A"  =  (Xv)veZd  is  called  a  Markov  random  field  if  it  possesses  the 
(local)  Markov  property,  that  is,  P(Ay  €  -\Xyo)  depends  only  on  XgV  for  every 
V  CC  1d. 

Just  as  Markov  chains  are  defined  by  transition  probabilities,  Markov  random 
fields  are  defined  by  a  family  of  local  transition  kernels  called  a  specification  [27, 
Chapter  1]  (cf.  Remark  6.1). 

Definition  7.11.  A  family  7  =  {cv)vcc&d  of  transition  kernels  on  ET‘d  such  that 

1.  7y(x,  A)  is  a  function  of  xgv  for  every  measurable  A  C  Ezd  and  V  CC  7Ld , 

2.  7y(:r,v4)  =  1,4(0;)  for  every  A  e  a{XVc}  and  V  CC  Zd, 

3.  7^71 w  =  7v  for  every  W  C  V  CC  Zd, 
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is  called  a  specification.  A  Markov  random  field  X  is  said  to  be  specified  by  7  if  we 
have  P(X  G  A\XVc)  =  'yv(X,A)  for  every  measurable  set  A  and  V  CC  7Ld .  The 
family  of  all  laws  of  Markov  random  fields  specified  by  7  is  denoted  ^(7). 

Example  7.12.  Standard  constructions  of  Markov  random  fields  arise  in  statistical 
mechanics  in  the  following  manner.  Let  ifv  :  E  — »  M.  and  <piv,w}  :  E  x  E  — y  I  for 
v,w  G  7/  with  ||r;  —  iy||  =  1  be  given  potential  functions,  and  let 

7 v(x,A)  =  ^  ^  1a(x)  exp  I  ^  T{v,w}(xv,xw) +  ^2^v(xv) 

xy&Ev  V  {v,w}cVUdV:\\v— ui||=l  v£V 

where  Z  is  the  appropriate  normalization  factor.  It  can  be  easily  verified  that  7  = 
ilv)vcczd  defines  a  specification.  The  potentials  and  (p{v,w}  describe  the  local 
external  and  interaction  forces  between  different  sites,  and  are  defined  directly  in 
terms  of  the  physical  parameters  of  the  problem.  For  example,  if  E  =  {  — 1,1}, 
T{v,w}(cri  cd)  =  fiJcrcr',  and  ipv(cr)  =  fi/xa  with  (3,J>  0  and  fj,  G  M,  this  is  the  well 
known  ferromagnetic  Ising  jnodel  with  inverse  temperature  (3,  interaction  strength  J 
and  magnetic  field  strength  /x.  The  construction  in  terms  of  potentials  will  be  inessen¬ 
tial  in  the  sequel,  however. 

Given  a  specification  7,  there  always  exists  a  random  field  in  ^(7)  under  our 
assumptions.  However,  just  as  a  Markov  chain  with  given  transition  probabilities 
may  admit  more  than  one  stationary  distribution,  the  random  held  associated  to  a 
given  specification  need  not  be  unique.  In  fact,  the  structure  of  the  set  ^(7)  is  closely 
related  to  the  spatial  mixing  properties  of  the  associated  random  fields,  as  is  shown 
by  the  following  result  [27,  section  4.4,  Proposition  7.11,  Theorem  7.7].  To  interpret 
the  notion  of  extremality  that  arises  here,  note  that  if  P  and  Q  are  the  laws  of  two 
random  fields  in  (7 ),  then  AP  +  (1  —  A)Q  is  also  in  (7 )  for  0  <  A  <  1  [27,  Chapter 
7];  thus  (7)  is  a  convex  set,  and  a  random  held  is  called  extremal  if  it  is  an  extreme 
point  of  this  set. 

Theorem  7.13.  For  a  given  specification  7,  the  following  hold. 

1.  Existence  of  a  random  field:  ^(7)  7^  0. 

2.  Uniqueness  -77  uniform  mixing:  \^{pt)\  =  1  iff  a  random  field  in  ^(7)  satisfies 3,4 

lim  sup  |P(AV  G  A\XWc  =  %c)  —  P(XV  G  A)\  =0 

w cczd  x 

for  every  set  A  and  V  CC  7Ld . 

3  Here  we  used  the  suggestive  notation  P(.Y  €  C\Xw^  =  Xw°)  ■=  lw(x,C)  to  emphasize  the 
significance  of  the  mixing  property.  Note  that  P(X  £  C\XWo)  =  yw{X,C)  holds  a.s.  by  the 
definition  of  ^(7),  but  the  equivalence  between  uniqueness  and  uniform  mixing  is  false  if  a  null  set 
is  omitted  in  the  supremum  over  x. 

4  The  notation  lim w  aw  denotes  the  limit  of  the  net  { aw },  where  {W  CC  Zd}  is  directed  by 
inclusion. 
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3.  Extremality  <5-  mixing:  the  random  field  X  is  an  extreme  point  of&fj)  iff 

lim  E|P(AV  G  A\XWc)  -  P(AV  G  A)\  =  0 

wcc  zd 

for  every  set  A  and  V  CC  7Ld . 

The  mixing  property  in  Theorem  7.13  is  a  direct  spatial  analogue  of  the  stability 
property  of  a  Markov  chain  introduced  in  Section  7.2.  Indeed,  a  Markov  chain  is 
stable  if  it  forgets  its  initial  condition  after  a  long  time:  that  is,  the  Markov  chain 
has  a  “finite  memory.”  Similarly,  a  random  field  is  mixing  if  the  distribution  of 
any  finite  set  of  sites  V  is  insensitive  to  knowledge  of  the  configuration  of  the  held 
outside  a  larger  set  W  when  the  distance  between  V  and  Wc  is  large.  This  implies 
in  particular  that  distant  sites  are  nearly  independent,  that  is,  the  held  has  “finite 
correlation  length.”  The  uniform  mixing  property  is  a  strictly  stronger  notion,  where 
the  forgetting  property  holds  uniformly  in  the  boundary  configuration  xqw  (recall 
that  by  the  Markov  property  of  the  random  held,  P(A  e  C \Xyyc  =  xwc )  depends  on 
xdw  only). 


7.6.2  Conjecture  on  inheritance  of  decay  of  correlations 

In  the  following,  let  us  hx  a  specification  7  and  a  Markov  random  held  X  =  (Xv)ve7j 
that  is  specihed  by  7.  In  order  to  investigate  the  conditional  distributions  of  random 
helds,  we  must  introduce  a  suitable  observation  structure.  To  this  end,  in  analogy 
with  Section  7.3,  let  us  hx  for  each  v  G  a  transition  kernel  from  the  state  space 
E  of  the  random  held  to  a  measurable  space  F  in  which  the  observations  take  their 
values.  We  now  construct  the  observations  Y  =  (Yv)vGZd  such  that 

P(Yedy\X)=  H$v{Xv,dyvy, 

v&Ld 


that  is,  each  site  of  the  underlying  held  is  observed  independently  with  P  ( Yv  e 
A\XV)  =  QV(XV,A).  The  resulting  model  (A v,Yv)veZd  is  called  a  hidden  Markov 
random  field. 

Remark  7.14.  For  notational  simplicity,  we  have  formulated  our  model  such  that  the 
observations  are  attached  to  individual  sites  v  G  7Ld .  One  could  also  consider  more 
general  models,  for  example,  where  an  observation  YjW)U,}  is  attached  to  every  edge 
{v,w}  C  If,  ||v  —  w||  =  1  with  P  G  A\X)  =  Xw,  A)  (cf.  Example 

7.17).  The  results  of  this  section  will  continue  to  hold  in  this  setting  with  minor 
modifications. 

We  can  now  formulate  the  natural  counterpart  of  the  hlter  stability  property  in 
hidden  Markov  random  helds:  the  model  is  said  to  be  conditionally  mixing  if  the 
conditional  distribution  of  the  underlying  process  in  a  finite  set  of  sites  given  the 
observations  is  insensitive  to  knowledge  of  the  configuration  of  the  held  at  distant 
sites. 
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Definition  7.15.  The  hidden  Markov  random  field  (. Xv,Yv)veZd  is  conditionally  mix¬ 
ing  if 

lim  E\P(XV  G  A\XWc,Y )  -  P(XV  G  A\Y)\  =  0 

w<zczd 

for  every  set  A  and  V  CC  7Ld . 

The  basic  question  to  be  addressed  in  this  setting  is  therefore:  when  is  the  mixing 
property  inherited  by  conditioning ,  that  is,  when  does  the  mixing  property  of  the 
random  field  X  imply  the  conditional  mixing  property  of  (X,Y)7 

It  will  be  insightful  to  reformulate  the  problem  in  different  terms.  For  simplicity, 
we  will  assume  in  the  sequel  that  the  observations  are  locally  nondegenerate,  that  is, 
that  &v(xv,dyv)  =  gv(xv,yv)  tp(dyv)  for  some  positive  density  gv(xv,yv)  >  0  for  all 

•Ey  i  Uv  • 

Proposition  7.16.  Define  for  every  y  G  Fzd  and  V  CC  7Ld  the  transition  kernel  on 

Ezd 

y,  ,v  _  /  1a(z)  ELev  9v(zv,  Vv)  Tv(xi  dz) 

7l  /El veV9v(zv,yv)~/v(x,dz) 

Then  the  following  hold. 

1.  =  (7v-)vcczd  a  specification  for  every  y  G  7Ld . 

2.  P(X  G  •  | Y)  is  in  £f(yY)  a.s. 

3.  (X,Y)  is  conditionally  mixing  iffP(X  G  •  |F)  is  extremal  in  )  a.s. 

Proof.  We  begin  by  verifying  that  is  a  specification.  To  this  end,  let  W  C  V  CC 
Zd.  As  7v7w  =  7v  and  7 w(fg)  =  fj Iwf  if  g(x)  depends  only  on  xw<=,  we  can  write 


ma>  n  gv(zv,  yv)  Tv{x,  dz) 

vev 

/  Tiicu)  /  n  9w  (^TtD  yw)lw(z',dz )  JJ  gv(z/v,yv)'fV(x,dz/) 

J  J  wew  v&v\w 

hi-^A)  n  9v(zv,yv)lv(x,dz). 


v&V 


Thus  7y7^/  =  7 y,  and  the  remaining  properties  of  a  specification  hold  trivially. 

Next,  we  show  that  P(A"  G  ■  |P)  is  in  a.s.  To  this  end,  let  us  fix  any 

regular  version  PJ  of  the  conditional  distribution  P(  •  T).  We  must  show  that  for 
a.e.  observation  record  y,  we  have  Py(X  G  A\Xyfi  =  7 y(X,A)  for  all  A,  that  is,  we 
must  show  that 


E;y(7 y(X,  A)1b)  =  P:y({A"  G  A}  fl  B)  for  every  measurable  A  and  B  G  a{Xyc} 


holds  for  P-a.e.  y. 
that 


Is  easily  seen  by  the  definition  of  a  hidden  Markov  random  field 


Tv{X,A)  =  P(A"  G  A\XVc,Y). 
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We  therefore  have 


e(7£(x, A)ibic)  =  P({x  eAjnBnC) 

for  every  A  and  B  G  a{Xyc},  C  G  cr{X}.  It  follows  by  disintegration  that 

ef(7£(x,  A)ib)  =  PY({x  eA}nB) 

holds  P-a.s.  for  a  fixed  choice  of  A,  B  G  cr{X^c},  and  thus  simultaneously  for  a 
countable  family  of  sets  A  and  B  G  a{XVc}.  By  choosing  the  countable  family 
to  be  a  generating  class  (note  that  all  our  cr-fields  are  countably  generated),  the 
above  identity  holds  simultaneously  for  every  A  and  B  G  <j{Xy c}  by  a  monotone 
class  argument.  As  there  are  only  countably  many  V  CC  Zd,  we  have  proved  that 
P(X  G  ■  | Y)  is  in  (7^  )  a.s. 

Finally,  we  consider  the  conditional  mixing  property.  As  the  limit  in  the  definition 
of  (conditional)  mixing  is  over  a  decreasing  net  (by  Jensen’s  inequality),  it  suffices 
to  consider  the  limit  along  any  fixed  cofinal  increasing  sequence  Wn  CC  Zd.  Thus 
by  the  martingale  convergence  theorem,  the  conditional  mixing  property  holds  if  and 
only  if 

lim  E(  |P(X  G  A|XWc,  Y)  -  P(X  G  A\Y)\  \ Y)  =  0  a.s. 

n—>oc  n 

for  every  V  CC  lA  and  A  G  a{Xv}.  As  we  have  shown  that  P(X  G  A\XWc,Y)  = 
7 yyn(X,A)  =  P1  (A"  G  A | X\yc ) ,  the  conditional  mixing  property  is  equivalent  to 

lim  Ey\Py(X  G  A\XWc)  -  PAX  G  A)\  =  0  for  P-a.e.  y 

n—>o o  n 

for  every  V  CC  7Ld  and  A  G  cr{Xy}.  But  by  the  martingale  convergence  theorem 
lim  Ey\Py(X  G  A\XWc)  -  Py(X  eA)\=  Ey\Py(X  G  A\f]na{XWc})  -  pv(X  G  A) |. 

n—>  00 

Thus  we  can  again  use  a  monotone  class  argument  as  above  to  remove  the  dependence 
of  the  P-null  set  on  V  and  A.  Thus  (Xv,  Yv)veZd  is  conditionally  mixing  if  and  only  if 

lim  Ey\Py(X  G  A\XWc)  -  PAX  G  A)  I  =  0  for  every  V  CC  Zd,  A  G  a{Xv} 
w  cczd 

holds  for  P-a.e.  y,  which  is  precisely  the  mixing  property  of  P(X  G  ■  |X).  □ 

Proposition  7.16  shows  that  the  conditional  distribution  P(X  G  ■  |X)  defines  again 
a  (random)  Markov  random  field,  and  gives  an  explicit  expression  for  its  specification 
.  The  inheritance  of  ergodicity  can  now  be  formulated  in  terms  of  the  ergodic 
properties  of  the  conditional  held.  In  particular,  we  can  pose  two  natural  questions: 

1.  If  P(X  G  • )  is  extremal  in  (7),  when  is  P(X  G  ■  |X)  extremal  in  £f(75  )  a.s.? 

2.  If  |£f(7) |  =  1,  when  is  |ff(71  )|  =  1  a.s.? 
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The  first  question  is  evidently  the  direct  spatial  analogue  of  the  filter  stability  prob¬ 
lem:  when  is  the  mixing  property  inherited  by  the  conditional  distribution?  The 
second  question  is  analogous,  but  for  the  uniform  mixing  property.  It  is  evident  from 
Theorem  7.13  that  |^(7y)|  =  1  a.s.  implies  the  conditional  mixing  property.  The 
stronger  conclusion  )|  =  1  a.s.  is  perhaps  less  natural  from  the  point  of  view  of 

conditional  distributions,  but  is  of  practical  relevance  in  its  own  right  as  it  is  closely 
connected  with  the  computational  complexity  of  MCMC  methods  for  Bayesian  image 
analysis  [26]. 

As  in  the  filter  stability  problem,  local  nondegeneracy  of  the  observations  does 
not  suffice  to  obtain  an  affirmative  answer  to  either  of  the  above  questions.  In  fact, 
we  have  a  direct  analogue  of  the  example  given  in  Section  7.4. 

Example  7.17  (Inheritance  of  decay  of  correlations,  phase  transition).  Let  E  = 
F  =  {  —  1, 1},  and  define  the  random  field  (Xv)veZ2  such  that  Xv  are  i.i.d.  symmetric 
Bernoulli  random  variables.  It  is  evident  that  this  model  is  uniformly  mixing  in  the 
most  trivial  sense  (thus  uniqueness  and  extremality  both  hold). 

We  now  attach  an  observation  Y{VjWy  to  each  edge  {v,w}  C  7Ld ,  ||r>  —  ic||  =  1  by 
setting  Y{VjWy  =  XvXw£{v>wy  with  £{„,«;}  i-i-d.  and  independent  of  X  with  P  (£{„,,«}  = 
— 1)  =  p.  In  this  manner,  we  evidently  obtain  a  direct  counterpart  of  the  model  of 
Section  7.f.  While  the  observations  in  this  model  are  defined  on  the  edges  rather  than 
on  the  vertices  as  we  have  done  in  this  section,  a  result  that  is  entirely  analogous  to 
Proposition  7.16  holds  in  this  setting  ( see  also  Remark  7.  If  above  and  Remark  7.18 
below). 

We  can  now  proceed  identically  as  in  the  proof  of  Theorem  7. 7  to  show  that  there 
exists  0  <  p*  <  1/2  such  that  the  hidden  Markov  random  field  {X,Y)  fails  to  be 
conditionally  mixing  for  p  <  p*.  In  fact,  this  is  precisely  the  idea  behind  the  proof  of 
Theorem  7.7  in  the  first  place:  the  model  (X%,Y£)k>vez  is  considered  as  a  space-time 
random  field,  and  the  problem  is  addressed  using  classical  methods  from  statistical 
mechanics. 

The  present  example  could  be  considered  as  a  toy  model  in  image  analysis.  The 
underlying  field  X  represents  a  grid  of  black  or  white  pixels  of  an  image,  and  the 
observations  Y  correspond  to  noisy  measurements  of  the  gradient  of  the  image  at 
each  point.  Thus  we  see  that  the  ability  to  reconstruct  the  image  based  on  the  noisy 
gradient  information  undergoes  a  phase  transition  at  a  positive  signal-to-noise  ratio. 

Remark  7.18.  The  use  of  edge  observations  in  Example  7.17  is  merely  cosmetic: 
the  same  example  can  be  reformulated  in  terms  of  vertex  observations.  Indeed,  let  us 
define  the  random  field  (Xv,  Yv)veZd  with  Xv  e  {  —  1,  l}3  and  Yv  e  {  —  1,  l}2  by  setting 
Xv  ( Xv,  At,_|_(oq) ,  A^j,_|_p  o) )  and  Yv  ^+(0,1)})  At)Ai;_j_pio)'b{'u,ti+(i,o)})? 

where  Xv  and  are  as  in  Example  7.17.  Then  X  is  still  a  uniformly  mixing 

Markov  random  field,  the  observations  Y  are  locally  nondegenerate,  and  P(X1  e 
■  |y)  =  P(A  G  ■  |Y).  In  particular,  the  above  conditional  phase  transition  arises 
identically  in  this  formulation. 

In  view  of  the  above,  the  inheritance  of  mixing  properties  of  random  fields  under 
conditioning  cannot  be  taken  for  granted.  Just  as  in  the  filter  stability  problem, 


120 


however,  it  is  natural  to  expect  that  conditional  mixing  will  hold  in  the  absence  of 
observation  symmetries.  Such  a  conjecture  is  often  implicit  in  work  on  Bayesian 
image  analysis  (cf.  [26,  p.  6]).  For  example,  we  can  formulate  the  natural  analogue 
of  Conjecture  7.9. 

Conjecture  7.19.  Let  (Xv,  Yv)veZ2  be  a  hidden  Markov  field  with  E  =  F  =  {  —  1, 1} 
and 

Yv  =  Xv£v,  (£v)vez2  are  Lid.  X  X  with  P(£„  =  -1)  =  p. 

If  the  underlying  random  field  X  is  mixing,  then  the  model  is  conditionally  mixing. 

We  do  not  know  how  to  prove  such  a  conjecture  in  a  general  setting.  However,  in 
[42]  we  establish  the  validity  of  such  a  result  under  monotonicity  assumptions  on  the 
underlying  field.  This  provides  an  entirely  different  mechanism  for  the  inheritance  of 
ergodicity  than  the  observability  theory. 
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Appendix  A 

Block  particle  filter:  proofs 


The  goal  of  this  appendix  is  to  prove  Theorem  4.2.  We  refer  to  Section  4.5  for  an 
overview  of  the  main  ideas  in  the  proof  that  we  are  going  to  present. 

Theorem  4.2  yields  a  bound  on  —  7r^|||  7.  As 


|7p  —  7rM| 


J< 


n: 


"-TT? 


n\\\J 


+  Pn 


M  -  7TM 

' 1  n 


bias 


v 

variance 


it  suffices  to  bound  each  term  in  this  inequality.  As  was  explained  in  Section  4.5.1, 
the  first  term  quantifies  the  bias  of  the  block  particle  filter,  while  the  second  term 
quantifies  the  variance  of  the  random  sampling.  The  bias  term  will  be  bounded  in 
Theorem  A.  12  below,  while  the  variance  will  be  bounded  in  Theorem  A. 21.  The 
combination  of  these  two  results  immediately  yields  Theorem  4.2. 

The  Dobrushin  comparison  method,  as  discussed  in  Section  4.5.2,  is  the  main 
workhorse  of  our  proof.  To  use  this  method,  we  must  be  able  to  bound  the  quantities 
Ctj ,  bj,  and  D,j  that  appear  in  the  Dobrushin  comparison  theorem  (Theorem  2.11). 
We  have  already  introduced  in  Section  2.3  and  Section  2.4  some  elementary  lemmas 
for  this  purpose.  We  also  need  the  following  lemma  to  bounds  C^. 


Lemma  A.l  (Minorization  condition).  Let  v,  z/,7, 7'  be  probability  measures  on  a 
measurable  space  ( E ,  £),  and  let  e  >  0  be  such  that  v(A)  >  e^(A)  and  A  (A)  >  £7  '(A) 
for  every  measurable  set  A.  Then 

\\W  -  u'\\\  <  2(1  -  e)  +  e|||7  -  7/[[[- 

In  particular,  if  7  =  7',  then  |||zz  —  z/|||  <  2(1  —  e).  The  same  conclusion  holds  if  the 
HI  •  ((I -norm  is  replaced  by  the  ||  ■  ||  -norm. 

Proof.  As  n  —  (1  —  £)_1(z/  —  £7)  and  p!  =  (1  —  e)~l{y'  —  £7')  are  probability  measures 
and  v  —  v'  —  (1  —  £) (/x  —  p!)  +  £(7  —  7'),  the  result  follows  readily.  P 
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A.l  Local  stability  of  the  filter 


The  main  goal  of  this  section  is  to  prove  a  local  stability  bound  for  the  nonlinear  filter. 
We  begin,  however,  by  introducing  a  number  of  objects  that  will  appear  several  times 
in  the  sequel. 

For  any  probability  measure  /i  on  X  and  x,  z  G  X,  v  G  V,  we  define 

&M)  ■=  p/W  e  A\xp{v}  =  x™,  x !  =  z) 

SU(xv)nweNiv)Pw(*,z“)ti(dxv) 

/  nweN(v)Pw(x, zw) 

(recall  the  notation  fivx  :=  Pm(Xq  G  •  =  :ry\W)  in  Section  4.5.2).  Let 

Cvv'  ■=  o  SUP  SUP  II t*Z,z  -  /4,J 

z  zex  XjxeXixV'WtrziiVXK} 

for  v,  v'  G  Id.  The  quantity 

Corr(^,/3)  :=  max 

V  v'£V 

could  be  viewed  as  a  measure  of  the  degree  of  correlation  decay  of  the  measure  /i 
at  rate  f3  >  0.  It  will  turn  out  that  this  (not  entirely  obvious)  measure  of  decay 
of  correlations  is  precisely  tuned  to  the  needs  of  the  proof  of  Theorem  4.2.  This  is 
due  to  the  fact  that  the  measures  nvx ,  arise  naturally  when  applying  the  Dobrushin 
comparison  method  to  the  smoothing  distributions  as  discussed  in  Section  4.5.2. 

Proposition  A. 2  (Local  filter  stability).  Suppose  there  exists  £  >  0  such  that 

£  <  pv(x,  zv )  <  £  l  for  all  v  G  V,  x,  z  G  X. 


Let  n,  v  be  probability  measures  on  X,  and  suppose  that 

Corr(/r,  (3)  +  3(1  —  £2A)e2/3rA2  <  l- 
for  a  sufficiently  small  constant  (3  >  0.  Then  we  have 


■  Fs+i/i  —  F „  ■  ■  •  Fs+iz/||  j 

<  2e~^n~s)  V  maxe“WB-*') 
v’ev 

v£j 


sup  ||^ 

x,z£X. 


V, 


X ,z  I 


for  every  J  C  V  and  s  <  n. 

Remark  A. 3.  There  is  nothing  magical  about  the  constant  1/2  in  the  decay  of  cor¬ 
relations  assumption;  any  constant  c  <  1  would  work  at  the  expense  of  a  constant 
1/(1  —  c)  rather  than  2  in  the  filter  stability  bound.  As  our  methods  are  not  expected 
to  yield  tight  quantitative  bounds,  we  have  taken  the  liberty  to  fix  various  constants  of 
this  sort  throughout  the  following  sections  for  aesthetic  purposes. 
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Remark  A. 4.  Note  that  by  Lemma  2.9 


\\p 


V 

x.z 


<  Z2A  11^ 


v„ 


This  yields  a  slightly  cleaner  bound  in  Proposition  A. 2  with  a  worse  constant.  For 
our  purposes,  however,  it  will  be  just  as  easy  to  bound  || pvx  z  —  z\\  directly. 

Proof.  Define  the  smoothing  distributions 


p  =  Pm(Aq,  . . . ,  Xn  G  .,Yn), 

p  =  P"(X0,...,XnG  ■  \Yl,...,Yn). 


We  will  apply  Theorem  2.11  to  p,  p  with  /  =  {0,...,n}xh  and  §  =  Xn+1  as  discussed 
in  4.5.2.  To  this  end,  we  must  bound  the  quantities  C ij  and  bj.  We  begin  by  bounding 
Cij  with  i  =  ( k,v )  and  j  =  (k',v').  We  distinguish  three  cases. 

Case  k  =  0.  The  key  observation  in  this  case  is  that  plx  =  pXQ  Xl  by  the  Markov 
property  (or  by  direct  computation).  Note  that  as  card N(v)  <  A,  we  have 


so  ||p 


V 

x,z 


KM)  = 
-K, 'll  <2(1 


/ R^mpM^KM)  “£  M  h 

—  e2A)  for  any  zyTl  by  Lemma  A.l.  Therefore 


Cij  < 


r<v 

^ VV ' 

\-e2A 

0 


if  k!  =  0, 

if  k!  =  1  and  v'  G  N(v), 
otherwise. 


This  evidently  implies  that 

e^e^v’v'lC{ 0,v)(k,y)  <  Corr (p,  (d)  +  (1  -  e2A)e^r+1)  A. 

(k',v’)ei 


Case  0  <  k  <  n.  Now  we  have  (cf.  Section  4.5.2) 

i  (A)  _  I  1a^I)  Pv(xk- 1,  xvk)  gv(xvk,  17)  n„,gjvp,)  Pw(*k,  xk+i)  ipv(dxvk) 
r,[A'  JPv(xkM,xl)gv(xvk,Ylf)l\weN(v)pw(xk,x^+1)'ipv(dxvk ) 

By  inspection,  plx  does  not  depend  on  xk,  except  in  the  following  cases:  k'  —  k  —  1 
and  v'  G  X(u);  k'  —  k  +  1  and  v'  G  X(u);  k!  =  k  and  v'  G  LLeAyy)  N(w).  As 


as  well  as 


Pi  (A)  >  e 


2A 


I 


1  A(xvk)  pv(xk-uxl)  gv(xvk,  Yf)  ?/r( dxvk) 
Jpv(xk- 1,  Xvk)  gv(xvk,  17)  ifv(dxvk) 


pUa)  >  c 


1 1  aK)  gv(xvk ,  Y%)  nmgjv(v)  Pw(xk,  xk+ 1)  V{dxp 
I  9vK,  Ykv)  n^j  Pwixk,  xk+i)  ipv(dxk) 
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we  can  use  Lemma  A.l  to  estimate 


l-£2 

if  k!  —  k  —  1  and  v1  G  N(v), 

1  —  £2A 

if  k'  —  k  +  1  and  v'  G  N(v), 

1  —  £2A 

if  k'  =  k  and  v'  G  U weN(v)  n( 

0 

otherwise. 

Cij  < 


This  yields 

J2  e^k~k'\ <  (1  -  £2A){e2/3rA2  +  2e/3(r+1)A} 

(k'  ,v')£l 

<  3(1  -  e2A)e2/3rA2, 

where  we  have  used  that  r  >  1  and  A  >  1  in  the  last  inequality. 

Case  k  =  n.  Now  we  have 


pUa)  = 


I  MO  Pv(xn- 1,  O  gv(xvn,  Y*)  il)v(dx\ 


>  e 


fPv(xn- 1,0  g*(x”n,Yj)^(dx\ 

OMOOOOKOO 


/  OOOKOO  ’ 

and  we  obtain  precisely  as  above 

1  —  e2  if  k!  —  n  —  1  and  v'  G  N(v), 


Cij  < 


0 


otherwise. 


We  therefore  find 


e®k~k'\ e^v^C{n,vWy)  <  (1  -  e2)e^A. 

(k'  ,v')El 

Combining  the  above  three  cases  and  the  assumption  of  the  Proposition  yields 


V  C(kMk,y)  <  1 

(fc'y)e/ 


Thus  Lemma  2.13  gives 

max  ]T  e^k~k'\  +dM}D{k,v){kiy)  <  2. 

(M)e/  (k,  ^eI 

Now  consider  the  quantities  bj  in  Theorem  2.11.  By  the  Markov  property,  it  is  evident 
that  plx  =  plx  whenever  i  =  ( k ,  v)  with  k  >  1.  On  the  other  hand,  for  k  =  0  we  obtain 
Px  =  Px0,xi  and  Px  =  vx0xr-  Applying  Theorem  2.11  therefore  yields 

IK  -  <IIj  =  lid  -  Pllwxj  <  D(n,v)(o,v>)  SUP  II Px,z  ~  <J- 


veJ  v'ev 
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However,  note  that 


Y  D(n,v)(o,v>)  sup  \\nt,z  ~  <J 
v'ev  x,zg\ 

=  e""”  V  e^+d^">'-D{nM e-«»y>  sup  ||<,  - 

v'&V 

<  2e-Pnmaxe~fid(v’v')  sup  ||<'  -  <'j, 

IV  1/  —TV  7  7 


tc,2:EX 


x  ,zEX 


using  the  above  estimate  on  the  matrix  D.  Substituting  this  into  the  bound  for 
||<  ~  <||  J  yields  the  statement  of  the  Proposition  for  the  special  case  s  =  0. 

To  obtain  the  result  for  any  s  <  n,  note  that  Fn  •  •  •  Fs+1/i  and  <_s  differ  only  in 
that  a  different  sequence  of  observations  (Hs+1, . . . ,  Yn  versus  Y\, ... :  Yn_s )  is  used  in 
the  computation  of  these  quantities.  As  our  bound  holds  uniformly  in  the  observation 
sequence,  however,  the  general  result  follows  immediately.  D 


As  a  corollary  of  Proposition  A. 2,  let  us  derive  a  simple  filter  stability  statement 
that  illustrates  the  role  of  decay  of  correlations  (this  will  not  be  used  elsewhere). 

Corollary  A. 5  (Filter  stability).  Suppose  there  exists  e  >  0  such  that 

£  <  pv(x,  zv)  <  s1  for  all  v  <G  V,  i,z6X, 


and  such  that 


£>  £o=  1 


6A2 


1/2A 


Then  for  any  probability  measures  onX  and  J  C  V,  n  >  0,  we  have 

||<  —  <||j  <  4 card  J^n^2r, 
where  7  =  6A2(1  —  e2A)  <  1. 

Proof.  We  first  apply  Proposition  A. 2  with  p  =  Sx.  Then  Corr (/x,  /3)  =  0  for  any 
/3  >  0.  Choosing  /3  =  —  (2r)_1  logy  >  0,  we  find  that 

Corr +  3(1  -  £2A)e2/3rA2  = 

so  that  the  assumption  of  Proposition  A. 2  is  satisfied.  Therefore, 

IK  -  <11  j  <  4  card  Je-f*  =  4  card  Jyn/2r. 

To  obtain  the  result  for  arbitrary  //,  note  that 

<«)  =  e  a\Yi,  ...,<) 

=  E^(P^(AA  e  A\x0, <, . . . , yn)|W, . . . ,  Yn) 

=  E^(7tK(A)|F1,...,W). 

Therefore,  by  Jensen’s  inequality, 


K  -  Kh  <  E'-dk?"  -  Kh\Vu 


.,y„)  <sup||<~<||j, 


which  yields  the  result. 


□ 
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While  Proposition  A.  2  requires  a  decay  of  correlations  assumption  on  the  initial 
condition  (Corr(/i,  j3)  must  be  sufficiently  small),  Corollary  A. 5  works  for  any  initial 
condition  provided  that  £  >  £0  is  sufficiently  large  (which  is  necessary  in  general,  see 
Section  4.4.1).  Thus  no  assumption  is  needed  on  the  initial  condition  if  we  want  to 
show  only  that  the  filter  is  stable  in  time.  On  the  other  hand,  Proposition  A. 2  controls 
not  only  the  stability  in  time,  but  also  the  spatial  accumulation  of  error  between  p 
and  v  by  virtue  of  the  damping  factor  e~l3d(v’v  b  the  decay  of  correlations  property  of 
the  initial  condition  is  essential  to  obtain  this  type  of  local  control.  The  latter  is  of 
central  importance  if  we  wish  to  obtain  local  error  bounds  for  filter  approximations 
that  are  uniform  in  time  and  in  the  model  dimension. 


A. 2  The  block  projection  error 

The  proof  of  a  time-uniform  error  bound  between  7r^  and  i f^  requires  two  ingredients: 
we  need  the  filter  stability  property  of  7 r^,  developed  in  the  previous  section,  in  order 
to  mitigate  the  accumulation  of  approximation  errors  over  time;  and  we  need  to 
control  the  approximation  error  between  7 and  ntf  in  one  time  step.  The  latter  is 
the  purpose  of  this  section. 

We  will  in  fact  consider  two  separate  cases.  To  control  the  total  error  \\n£  —  tt£\\ j, 
we  need  to  consider  the  one-step  error  made  in  each  time  step  s  —  1, . . . ,  n.  For  time 
steps  s  <  n  (for  which  the  error  is  dissipated  by  the  stability  of  the  filter),  the  error 
must  be  measured  in  terms  of  the  quantities  that  appear  in  Proposition  A. 2:  that 
is,  we  must  control  ||(F sv)vx  z  ~  (Fs^)^z||.  On  the  other  hand,  in  the  last  time  step 
s  =  n,  we  must  control  directly  ||F nv  —  Fnz/||j.  While  the  proofs  of  these  cases  are 
quite  similar,  each  much  be  considered  separately  in  the  following. 

We  begin  by  bounding  the  error  in  time  steps  s  <  n. 

Proposition  A. 6  (Block  error,  s  <  n).  Suppose  there  exists  e  >  0  such  that 

£  <  pv(x,  zv)  <  £~l  for  all  ueV,  x,  z  G  X. 


Let  u  be  a  probability  measure  on  X,  and  suppose  that 

Corr(z/,  fd)  +  (1  -  e2)e^r+1)A  <  ^ 

for  a  sufficiently  small  constant  f3  >  0.  Then  we  have 

sup  ||  (F su)vx>z  -  (Fsz/)y|  <  4e_,3(l  -  e2A)  e 

x,z£X. 


(3d(v,dK) 


for  every  s  G  N,  K  G  X  and  v  G  K . 

This  result  makes  precise  the  idea  that  was  heuristically  expressed  in  Section  4.3: 
if  the  measure  v  possesses  the  decay  of  correlations  property,  then  the  error  at  site  v 
incurred  by  applying  the  block  filter  rather  than  the  true  filter  decays  exponentially 
in  the  distance  between  v  and  the  boundary  of  the  block  that  it  is  in. 
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Proof.  We  begin  by  writing  out  the  definitions 

=  /  Ms)  ELgy  sw)  Ysw)  u(dx o)  V’(ch) 

/  rLev^Oo,  a;®)  *?")  K^o)  ip(dx)  ’ 

(P  4)  =  /  Ms)  fW  [  /  UweK>  PW(X o> ysw)  v(dxo)\  ^(dx) 

I  YIk'cx  [  /  TlweK>  Pw(xo, xW)  9w(xw,  Ysw )  v(dxo)]  V>( dx ) 

Let  us  fix  K  £  DC,  n  £  iL  throughout  the  proof.  Then 

=  /  IaQe")  >7)  ELev^^o,  xw)  v{dx o) 

^  f  gv(x'’,Ysv)UweVPw(xo,xw)v(dxo)ipv(dx'’)  ’ 

/p  =  I  1a(xv)  gv(xv,  Yf )  Y\w&kpw{xq,  xw )  z/(dx0) 

sV)A  J  gv(xv,Yf)UweKPw(xo,xw)iy{dx0)^(dx^ 

Define  /  =  ({0}  xf)U  (l,i>)  and  §  =  X  x  Xu,  and  the  probability  measures  on  § 


P(A)  = 

f  lA(x0,  xv )  gv{xv,  Ysv )  ELev^^o,  ELejv(„)  ^(^o)  V^(^) 

/  r/)  n  wevpw(x<h xw )  nuejv(^  pu(?’ zu)  u(dxo)  i> v(dxv ) 

m  = 

J  lA(xo,  xv)  gv{xv,  Yf)  Uu  £kPw(xo,  xW)  ELejv(«)  Pu(x >  zM)  V(dxv) 
I  gv(xv,  Ysv)  UweK  Pw(x o, xW)  EEejv(<;)  Pv(x,  zU )  v( dxo ) 

Then  we  have  by  construction 


II(FW) 


V 

x,z 


IIP-P||(1,D- 


We  will  apply  Theorem  2.11  to  bound  \\p  —  p|| (i,v) -  To  this  end,  we  must  bound  Ci3 
and  bi  with  i  =  (k',v')  and  j  =  ( k",v ").  We  distinguish  two  cases. 

Case  k'  =  0.  In  this  case  we  have 


P(x 0,xv)  (^) 


P(x 0,xv)  (^) 


I  U «)  IlweN(v')PW(X0ixW)  VxMXt) 

I  nweAr(^)Pw(^o,^)  <(^0 
/  m^o)  n  weN(v')nKPW(xOi  xW)  ux0(dxo  ) 

I  TLueN(v')rKPW(xOy  xW )  K'0(dxo) 


In  particular,  p\xoxV)  =  uvXq  X, 


so  Cij  <  C”,  „  if  k"  =  0.  Moreover,  as 


i  ^  9  /  !a«)  PW (x°-> xW)  K'0(dxo'] 

/w’)l  ’~e  I n^M\w r(x o. *«) <W) 


we  have  Qj  <  1  —  e2  if  k"  =  1  (so  v"  =  v)  and  v  £  N(v')  by  Lemma  A.l,  and  Cl3  =  0 
otherwise.  We  therefore  immediately  obtain  the  estimate 


ereM«'>")C(0:t)/)(fcV,)  <  Corr(u,  p)  +  (1  -  £2)e^+1). 

(k",v")ei 
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On  the  other  hand,  note  that  p)XQ  xV ^  =  p)XQ  xV ^  if  N(v')  C  A',  and  that  we  have 
p\XQ.xv)  >  £2AK'0  and  P\XQ]Xv)  >  £2AVx0-  Therefore,  by  Lemma  A.l 


bi 


SUP  \\p\x0,x-)  -  p\x0,xv)\\  < 
(xq,xv)& 


0 

2(1  —  e2A) 


for  v'  G  K\dK, 
otherwise. 


Case  k'  —  1.  In  this  case  we  have 


P(x  0,XV)(^^  P(xo,Xv)i.-^} 

f  1  a{xv)  gv(xv,  Ysv)pv(x0,  xv )  n ueN(v)  PU(xi  )  V{dxv) 

1 9v{x\  Yf)  pv(x o,  X”)  Uu&Niv)  Pu{x,  zu)  v(dx «) 

Thns  bi  =  0,  and  estimating  as  above  we  obtain  Cij  <  1  —  e2  whenever  k"  =  0  and 
v"  G  N(v),  and  =  0  otherwise.  In  particular,  we  obtain 

J2  e^-k”\ e^")C(M)(w)  <  (1  -  £2)e^+1)A. 

(k",v")ei 


Combining  the  above  two  cases  and  the  assumption  of  the  Proposition  yields 


max  Y^  pP{ lfc'-fc"l+d0>")}/^  <  I 

,  e  ^ (k‘ ,v’)(k" ,v")  A  „• 

( k',v')El  —  Z 

(k",v")ei 


Applying  Theorem  2.11  and  Lemma  2.13  gives 

mv)i*-{tsv)ij  =  \\p-p\hi,V) 

<  2(1  —  e2A)  ^  D(i,v)(oy) 

v'eV\(K\dK) 

<  Ae-P{l-£2A)e~Pd{v’dK). 


As  the  choice  of  x,  z  G  X  was  arbitrary,  the  proof  is  complete.  □ 

We  now  use  a  similar  argument  to  bound  the  error  in  time  step  n. 

Proposition  A. 7  (Block  error,  s  —  n).  Suppose  there  exists  £  >  0  such  that 
£  <  pv(x,  zv )  <  for  all  v  G  V,  x,  z  G  X. 

Let  is  be  a  probability  measure  on  X,  and  suppose  that 

Corr (i/,  /3)  +  (1  -  e2)e^(r+1)A  <  ^ 

for  a  sufficiently  small  constant  (3  >  0.  T/ien  we  /lave 

||  F nis  -  Fnis\\j  <  4e-/J(l  -  £2A)  e-MJ’dK>  card  J 


for  every  K  G  X  and  J  C  K . 
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Proof.  Define  I  =  {0, 1}  x  V  and  §  =  X2.  Fix  K  G  X,  and  let 

=  f  1a(x0,  gj)  n,.gy  P°(xo,  x\)  gv(x\,  Yf)  v(dx0)  ipjdxi) 

JY[veVpv(xo,xl)gv(xl,Y^)i/(dxQ)il}(dx1) 

~(A]  =  /  lA(gQ,xi)  YIc&kP1' (xOixi)  Ilwzv9w(xi,Yn)  K^o)  ^{dxf) 

/l" 1  /  nw6*  W*0,  *?)  EUv  w 

Then  for  any  J  C  K,  we  have 

II F ni/-  Fni/||j  =  ||p-p||{i}xj- 


We  will  apply  Theorem  2.11  to  bound  ||p  —  p|| {i}x  J-  To  this  end,  we  must  bound  Cij 
and  bi  with  i  =  (k,v)  and  j  =  (. k',v ').  We  distinguish  two  cases. 

Case  k  =  0.  In  this  case  we  have 

i(A)_f  M*S)  <MXV) 

Pl{  ’ 

H  (A)  =  I  Yl^N(v)nKPW(xo,xi)  <0(dxo) 

Px  JUu&N{v)nKPw(xo,xf)K0(dxv0) 


In  particular,  px  =  uXQ  xi,  so  Cl3  <  Cfv,  if  k!  =  0.  Moreover,  as 


Pl(A)  >  c 


I  M^o)  y;w) 

/  K0(dxo) 


we  have  Cl3  <  1  —  e1  if  k’  =  1  and  v'  G  iV(n)  by  Lemma  A.l,  and  =  0  otherwise. 
We  therefore  immediately  obtain  the  estimate 


^  eWeWv,*)C{0)VWiV>)  <  Corr(z/,  /3)  +  (1  -  e2)e^A. 

(k',v')ei 

On  the  other  hand,  note  that  plx  =  plx  if  N(v)  C  K,  and  that  we  have  plx  >  £2AvXQ 
and  px  >  £2AuXq.  Therefore,  we  obtain  by  Lemma  A.l 


bi  =  sup  ||p*  -  plx ||  < 
xSS 


0  for  v  G  K\dK, 

2(1  —  e2A)  otherwise. 


Case  k  =  1.  In  this  case  we  have 

/  lA(x\)pv(x0,  x\)  gv{xv1,  Yf)  il>v(dx\) 


Px(A)  = 


while  pf  =  pf  if  v  G  K  and 


fpv(x o,  x\)  gv(x\,  Y”)  if>v(dx\) 


PUA )  = 


fl  A(x\)g^xlYf)r{dx\) 

f  gv(xi,Y”)i/’v(dx?) 
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otherwise.  Thus  we  obtain  from  Lemma  A.l 


bi  =  sup  \\plx  —  plx\\  < 


0  for  v  G  K, 

2(1  —  e2)  otherwise. 


On  the  other  hand,  we  can  readily  estimate  as  above 

X]  ^ll~k'1  epd{v’v,)C(liV){k>/)  <  (1  -  e2)e^A. 

(k',v')ei 


Combining  the  above  two  cases  and  the  assumption  of  the  Proposition  yields 

1 

2' 


ng,  X 

(fc'y)e/ 


Applying  Theorem  2.11  and  Lemma  2.13  gives 
II F nv  -  Fnv\\j  =  \\p  -  p||{i}xj 

<  2(1  -  e2A)  E  I  E  ^(i,«)(oy)  +  E 

veJ  {  v'£(V\K)UdK  v'eV\K 

<  4e_/3(l  -  e2A)e-MJ’dK)  card  j 
for  every  J  C  K. 


□ 


A. 3  Decay  of  correlations  of  the  block  filter 


To  idea  behind  the  block  filter  i is  that  the  error  should  decay  exponentially  in  the 
block  size  by  virtue  of  the  decay  of  correlations  property.  While  we  have  developed 
above  the  two  ingredients  (filter  stability  and  one-step  error  bound)  required  to  obtain 
a  time-uniform  error  bound  between  7r^  and  7r^,  we  have  done  this  by  imposing  the 
decay  of  correlations  property  as  an  assumption.  Thus  perhaps  the  crucial  point 
remains  to  be  proved:  we  must  show  that  decay  of  correlations  does  indeed  hold, 
that  is,  that  Corr(7 f^,/3)  can  be  controlled  uniformly  in  time.  This  is  the  goal  of  the 
present  section. 

Unfortunately,  Corr(7r^,/3)  is  not  straightforward  to  control  directly.  We  therefore 
introduce  an  alternative  measure  of  correlation  decay  that  will  be  easier  to  control. 
For  any  probability  measure  p  on  X  and  x,  z  G  X,  v  G  V,  K  G  X,  let 


fj,vx£ (A)  :=  P%Y0"  G  A|x0nw  =  Xf  =  zK) 

_  I  U(xv)  UaleN{v)nKPw(x^  zw)  K(dxv) 

jnweN(v)nKPw(x,zW)Vx(dxV) 


We  now  define 


C'w  :=  d  max  sup  sup  ||/4’,f  ~  /4’,f  I 
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for  v,  v'  G  V.  The  quantity 

Corr(yU,  p)  :=  max  V 

vev  ^ — J 

v’ev 

is  a  measure  of  correlation  decay  that  is  well  adapted  to  the  block  filter.  In  order  for 
this  quantity  to  be  useful,  we  must  first  show  that  it  controls  Corr(p,/3). 

Lemma  A. 8.  For  any  probability  measure  /i  and  /3  >  0,  we  have 

Corr(/i,  p)  <  (1  —  e2A)e20rA2  +  2£“2A  Cott(//,  (3). 

Proof.  By  definition 


&M)  = 


I  1  a(xV)  v(v)\KPw(x^w)^x,z(dxV^ 


I  n«,eJV(»)\jf  Pw(x:  zW)  $2  ( dxv ) 

Let  x,x  G  X  be  such  that  If  v'  qL  Uu,eJV(u)  w ),  then 

II Fvx,z  ~  /4,J  <  2£'2A|I Fx,z  ~  /4’,fll 

by  Lemma  2.9.  On  the  other  hand,  note  that 

&M)  ^  £2A/4;f  M  pIM)  >  £2A/4;f(^)- 


We  can  therefore  estimate  using  Lemma  A.l  for  v'  G  U«;e.v(?j)  N{w) 

\K,Z  -  /4,J  <  2(1  -  £2A)  +  £2AIIK;f  -  /4;fll- 


Thus  we  obtain 

Corr(p,  p)  <  (1  -  e2A)  max  V  +  2e~2A  C\Trr(/i,  £) 

v'£liweN(v)  N(w ) 

<  (1  -  £2A)e2/3rA2  +  2£-2A  Corr(/i,  £). 

As  /i  and  /3  were  arbitrary,  the  proof  is  complete.  □ 

We  now  aim  to  establish  a  time-uniform  bound  on  Corr(7 i!f,f3).  To  this  end,  we 
first  prove  a  one-step  bound  which  will  subsequently  be  iterated. 

Proposition  A. 9.  Suppose  there  exists  e  >  0  such  that 

e  <  pv(x ,  zv)  <  e~l  for  all  v  G  V]  x,z  G  X. 

Let  v  be  a  probability  measure  on  X,  and  suppose  that 

Corr(z/,  /?)  +  (1  -  £2)e/3(r+1)  A  <  ^ 

for  a  sufficiently  small  constant  (3  >  0.  Then  we  have 

Corr(Fsi/,  /3)  <  2(1  -  e2A)e2/3r A2 


for  any  sGN. 
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Proof.  Let  K,K '  e  X,  v  G  K,  v'  G  V  (V  7^  v),  and  let  z,x,x  G  X  such  that 
xv\{v  ^  =  xv\{v  }.  These  choices  will  be  hxed  until  further  notice. 

Define  /  =  ({0}  xf)U  (1,  v)  and  §  =  X  x  Hv,  and  let 


P(A)  = 

fAgv(xv,Ysv)il,u£Kpw(x0,xw)  Ylu£N(v)nK'PU(x,zU)  vjdxp)  fjv(dxv) 
f  gv(xv,  Yf)  YlW£KPw(x<h  xW )  UueN(v)nK'  Pu(x ,  *“)  ’ 

P(^)  = 

fAgv(xv,Ysv)  UweKPw(xo,xW)  Ylu£N(v)nK'PU(x,zU)  v(dx0)ij)v(dxv) 

I  gv(xV ,  Ysv )  ELeA^^o,  #")  IIuejv(«)n^'  Pu(£>  2“)  K^o)  4>v(dxv)  ’ 

Then  we  have  by  construction 

ll(^;f-(F^;fll  =  llp-p||(M). 

We  will  apply  Theorem  2.11  to  bound  ||p  —  p||(i,„)-  To  this  end,  we  must  bound  C\j 
and  bi  with  i  =  ( k,t )  and  j  =  ( k',t ').  We  distinguish  two  cases. 

Case  k  =  0.  In  this  case  we  have 


P(XQ,XV)  (^) 
P(xq,Xv)(^) 


/  1  aK)  U^N{t)nKPW(xo^xW)  *40(d4) 

I  UweN(t)nKPw(xo, xW)  <0(dxo) 

f  !a(4)  n^eJvwn/c^’K,^)  <(^o) 


/  UweN(t)nKPw(x<h  xW )  <(^o) 

Note  that  p)XQXv\  =  1Jx^x ■  We  therefore  have  CV,-  <  Cft,  when  k!  =  0.  Moreover, 

i  2 /1^(4)n^eiV(t)n(ic\W)P“’(a;o,^)^0(^o) 

/W)l  /n«(,mi.))^h^KW) 

implies  Cij  <  1  —  £2  if  k’  =  1  and  v  £  N(t)  by  Lemma  A.l,  and  CV,-  =  0  otherwise. 
On  the  other  hand,  note  that  as  xv^v  ^  =  jAl®  1  we  have  pi  xV ^  =  pi  xV ^  if  v'  ^ 
N(t)  fl  K,  while  both  p\XOiXv){A)  and  p\XQjXV){A)  dominate 

I  M4)  ^x0(dxo) 

jnweN(t)n(K\{v'})Pw(xO’xW)"l0(dxto ) 

Therefore,  by  Lemma  A.l 

"0  for  v'  <£  N{t)  0  K, 

2(1  —  e2)  otherwise. 

Case  k  —  1.  I11  this  case  we  have 


bm  < 


i  _  I  1a(xv)  gv{xv ,  Yav)pv(x 0,  xv)  n»gjv(t,)n^  Pu(x>  z") 

P{X0,XV)[  ~  Jgv(xV,Yv)pv(x0,xv)UueN(v)nK'Pu(x,zu)^v(dxV) 


P(x 0,XV)  (^) 


/  1A(^)  9v(x°,  Ysv)pv(x 0,  £w)  nuG7vwnA-'  PU(xi  *“) 
/  1?)  pv{x 0,  x«)  n,Siv(,)nA"  -u) 
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Estimating  as  above,  we  obtain  CtJ  <  1  —  e2  whenever  k'  —  0  and  t'  e  N(v),  and 
Cij  =  0  otherwise.  Similarly,  arguing  again  as  above,  we  obtain 


b  <  \  0  for  v'  llejv(„)n/w 

{1,v)  ~  [2(1  -e2A)  otherwise. 

Dehne  the  matrix  {Cij(v)}ij£j  with  the  following  entries: 

C'(o,t)(o,t')('u)  =  C'tto 

C(0,t)(l,i))  (^)  C(l,u)(0,t)  iv')  (1  £  )l<SAr(u)) 

^-'(l,?/)(l,'u)  (^)  O' 

Combining  the  above  two  cases  yields  Ctj  <  Cij(v),  and  we  readily  compute 
Y,  ^{|fe-fc'l+d(M')}C'(fc,t)(fe,,t0(r;)  <  Corr(zy  (3)  +  (1  -  e2)e^A  <  ^ 

( k',t')ei 

where  we  have  used  the  assumption  of  the  Proposition.  By  Theorem  2.11 

W*v)x?  -  (M^fll  =  \\p-p\\(i,v) 

<  2(1  —  e2)  1  v>eK  ^2  D{l,v){0 ,t')iv) 

t'eN(v') 

+  2(1  —  £  )  li>'6U„eAr(^)nK'  N(w)  D(l,v)(l,v)(v) 

where  D(v)  :=  J2n>oC{v)n-  But  note  that  the  right-hand  side  does  not  depend  on 
Kl  or  z,x,x  (provided  —  ^WIV}).  We  therefore  obtain 

<(1-£2)1  v'gk  ^  D0-,v)(o,t')(v) 

t'eN(v’) 

+  (1  —  e  )  l^'eU^sjvwnif' N(w)  D(i,v)(i,v)(v) 

for  every  K  E  X,  v  E  K,  and  v'  E  V. 

To  proceed,  we  note  that 

T  e^^Cl-r  <  (1  -  e2)  v  V  D(M(0, ,,,(») 

v'eV  v'eK  t'eN(v’) 

+  (1  -  £a)  D{ i, „)(!,„,(»)  V 

v'^JweNiy)nK'  N(w ) 

<  (1  -  £2A)e2^A2  e/3{|i-fe'|+^y)}jD(i^)(fe,y)(?;)_ 

Applying  Lemma  2.13  to  C{v)  yields  the  result.  □ 

We  now  iterate  the  above  result. 
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Corollary  A. 10.  Suppose  there  exists  e  >  0  such  that 

£  <  pv(x,  zv)  <  £~v  for  all  v  G  V,  x,  z  G  X, 

and  such  that 


Let  p  be  a  probability  measure  on  X  such  that 

Corr(/i,  0)  < 

where  (5  =  — (2r)_1  log  16A2(1  —  e2A)  >  0.  Then 

Corr(7 f£,/5)  <  -  for  all  n  >  0. 

8 

In  particular,  the  latter  holds  whenever  p  —  Sx  for  any  x  G  X. 

Proof.  The  assumption  e  >  e0  implies  (3  >  0  and 

(1  -e2)e/3(r+1)A  <  — . 

16 

Therefore,  if  Corr(/y,  /3)  <  1/8,  then  Proposition  A. 9  yields 

Cott(F sv,P)  <  2(1  -  e2A)e20rA2  < 

8 

Thus  if  Corr {p,f3)  <  1/8,  then  Corr(7r^,/3)  <  1/8  for  all  n  >  0.  Moreover,  as 
Corr(5a,,/3)  =  0,  the  result  hold  automatically  for  /i  —  Sx.  □ 

We  finally  obtain  the  requisite  bound  on  Corr(7f^,/3)  using  Lemma  A. 8. 

Corollary  A.  11  (Decay  of  correlations).  Suppose  there  exists  £  >  0  with 

£  <  pv(x,  zv)  <  £  l  for  all  v  G  V,  x,  z  e  X, 

such  that 

l  \  1/2A 

16A2 ) 

Let  fd  =  — (2r)_1  log  16A2(1  —  e2A)  >  0.  Then 

Corr(7r^,  (3)  <  ^ 

for  every  n  >  0  and  x  G  X. 

Proof.  By  Corollary  A. 10  and  Lemma  A. 8,  we  can  estimate 

Corr«,  fS)  <  L  +  t£-'2a  <  l 

where  we  used  that  e2A  >1  —  1/16.  0 


£  >  £0  =  1 
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A. 4  Bounding  the  bias 

In  the  previous  sections,  we  have  proved  a  local  filter  stability  bound  (Proposition 
A. 2),  a  local  one-step  error  bound  (Propositions  A. 6  and  A. 7),  and  decay  of  correla¬ 
tions  of  the  block  filter  (Corollary  A.  11).  We  can  now  combine  these  results  to  obtain 
a  time-uniform  error  bound  between  the  filter  and  the  block  filter;  this  controls  the 
bias  of  the  block  particle  filtering  algorithm. 

Theorem  A.  12  (Bias  term).  Suppose  there  exists  e  >  0  such  that 
e  <  pv(x,  zv)  <  £  1  for  all  v  E  V,  x,z  E  X, 


and  such  that 


£  >  £q  —  (  1 


\  1/2A 


18A2 


J 


Let  f3  =  —  (2r)  1  log  18A2(1  —  £2A)  >  0.  Then 

8e~P 


\tq-KWj  < 


-Al-e2^)  card 


1  —  e~h 

for  every  n  >  0;  x  E  X,  K  E  %  and  J  C  K. 

Proof.  We  begin  with  the  elementary  error  decomposition 


X  ~X  II  ^ 
n7i  -  j  < 


EIIF--  ^s+l^s-l  -  F  n  ■  •  •  F  s+!F 


J- 


s=l 


We  will  bound  each  term  in  the  sum. 

Case  s  =  n.  To  bound  this  term,  note  that 

CarrpC,,  P)  +  0  ~  AV(r+I)  A  <  i  +  P  <  i 

by  Corollary  A. 11.  Therefore,  applying  Proposition  A. 7  with  v  =  7f^_1,  we  obtain 
||  F rAn-1  -  tnK-i\\j  <  4e-/3(l  -  £2A)  e~hd(LaK )  card  J. 

Case  s  <  n.  To  bound  this  term,  note  that  by  Corollary  A.  11 

Corr«,  (!)  +  3(1  -  £2A)e2/3rA2  <  1  +  1  =  i. 

3  o  2 

Applying  Proposition  A. 2  with  //  =  and  v  =  Fs7ff_1  yields 

||  F n  '  '  ‘  Fs+iFs7Ts_1  Fn  •  •  •  Fs+iFs7Ts_1  ||  J 

<  sup  ||(F,f  -  (F.<_1)£,||. 

^ — *  v'EV  - 


vEJ 


x,z£X. 
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On  the  other  hand,  as  by  Corollary  A.  11 


Corr «_,,/?)  +  (1  -  e2)e»+1>A  <  1  +  1  <  1, 
we  have  by  Proposition  A. 6  with  v  =  7tf_1 

sup  ||(F.*f_1)t-(F.#f_1)tll  <4e-'’(l-e2A)e-««'.fut). 

x,2:GX 

We  therefore  obtain  the  estimate 


||  Fn  ’  ’  *  ^s+l^s^s-1  F?i  *  *  *  Fs+lFs^s-ilU 

<  8e_/3(l  -  e2A)  e-P(n-s)e-8d(j,dK)  card  ^ 

where  we  have  used  d(v,  v ')  +  d(v',  dK)  >  d(v,  dK). 

Substituting  the  above  two  cases  into  the  error  decomposition  and  summing  the 
geometric  series  yields  the  statement  of  the  Theorem.  D 


A. 5  Local  stability  of  the  block  filter 

As  was  explained  in  Section  4.5.4,  the  chief  difficulty  in  obtaining  a  time-uniform 
bound  on  the  variance  term  is  to  establish  stability  of  the  block  filter.  This  will  be 
done  in  the  present  section. 

We  first  establish  a  stability  bound  for  nonrandom  initial  conditions. 
Proposition  A. 13.  Suppose  there  exists  e  >  0  such  that 

e  <  pv(x,  zv )  <  e^1  for  all  v  €  V,  x,  z  e  X, 

and  such  that 

1  \1/2A 

1  ~  6A V 

Let  j3  =  —  log6A2(l  —  e2A)  >  0.  Then 

II F n  ■  •  •  Fs+152  -  F n  ■  ■  ■  Fs+1^ II J  <  4  card  J  e~^n~s) 
for  every  s  <  n,  z,  z'  e  X,  K  e  X,  and  J  C  K. 

Proof.  Fix  throughout  the  proof  n  >  0,  K  G  X,  and  J  C  K.  We  will  also  assume 
throughout  the  proof  for  notational  simplicity  that  s  =  0  (the  ultimate  conclusion 
will  extend  to  any  s  <n  as  in  the  proof  of  Proposition  A. 2). 

We  begin  by  constructing  the  computation  tree  as  explained  in  section  4.5.4. 
For  future  reference,  let  us  work  first  in  the  more  general  setting  where  the  initial 
distributions  /i  =  <§§k,gX  pK  and  v  =  <S)K'ex  uK  are  independent  across  the  blocks 
(rather  than  the  special  case  of  point  masses  Sx  and  Sx/).  Define  for  K'  e  X 

N(K')  =  {K"  G  X  :  d(K',  K")  <  r}, 


£>  £0  = 
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that  is,  N(K')  is  the  collection  of  blocks  that  interact  with  block  K'  in  one  step  of 
the  dynamics  (recall  that  cardiV(A'/)  <  A^-)-  Then  we  can  evidently  write 

B*'F>  =  cfp*'  0  yy 

K"eN(K') 


where  we  have  defined  for  any  probability  on  XK' 

(Ck')(A)  =  f  EUa r  9v(xv,  1?)  y(dxK') 

{sV){) ■  /n  veK,g^,Ys^rj(dxn  ’ 

and  for  any  probability  r)  on  X^K"6Ar(K') K 

(Ph'r])(A)  :=  f  1  a{xK')  JJ  pv(z,xv)^v(dxv)r](dz). 


v&K' 


We  therefore  have 

BA  Fn  •  •  ■  Pin  = 


Kr>K 


c  p 


Kn^eN(K) 


rKn- 1  pAy_i 

v“n— 1  r 


ATn_2eAT(A:n_i) 


(~Kn-2  pifn-2 
^-21-2  r 


£ 

a. 

<8) 

£ 

CL 

u 

1 _ 

L  A'oSV(A'i)  J 

“I  “I 

A'iSA^)  L 

The  structure  of  the  computation  tree  is  now  readily  visible  in  this  expression.  To 
formalize  the  construction,  we  introduce  the  tree  index  set 


T  :  =  { [Ku  ■  ■  ■  Kn_ i]  :  0  <  u  <  7i,  Ks  G  N(KS+1)  for  u  <  s  <  n}  U  {[0]} 

where  we  write  Kn  :=  K  for  simplicity  (recall  that  K  and  n  are  fixed  throughout).  The 
root  of  the  tree  [0]  represents  the  block  K  at  time  n,  while  [Ku  ■  ■  ■  Kn_x]  represents 
the  duplicate  of  block  Ku  at  time  u  that  affects  block  K  at  time  n  along  the  branch 
Ku  — >  Ku+ 1  —>■•••—>  Kn_ i  — >■  K  (cf.  Figure  4.4  for  a  simple  illustration).  The  vertex 
set  corresponding  to  the  computation  tree  is  defined  as 


/  =  {[#«■■■  Kn-i]v  :  [Ku  •  •  ■  iW-i]  e  T,  v  G  Ku}  U  {[0}v  :  v  G  A'}, 


and  the  corresponding  state  space  is  given  by 

§  =  JJX\  X[t]u  =  Xv  for  [t]v  G  /. 

i£l 

It  will  be  convenient  in  the  sequel  to  introduce  some  additional  notation.  First,  we 
will  specify  the  children  c(i )  of  an  index  i  G  /  as  follows: 

c([Ku  •  •  •  Kn_ i]u)  :=  {[Ku_ i  •  •  •  A^K  :  Wu_i  6  W(tftt),  v'  G  iV(n)}, 
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and  similarly  for  c([0]v).  Denote  the  depth  d(i)  and  location  v(i)  of  i  G  I  as 
d([Ku  ■  ■  ■  Kn_i\v)  :=  u,  d([0]v)  :=  n,  v([t]v)  :=  v. 

We  define  the  index  set  of  non-leaf  vertices  in  /  as 

I+  :=  {i  G  /  :  0  <  d(i)  <  n}, 
and  the  set  of  leaves  of  the  tree  T  as 


T0  :=  {[K0  ■  ■  •  Kn_x\  :  Ks  G  N(KS+1)  for  0  <  s  <  n}. 

Finally,  it  will  be  natural  to  identify  [t]  G  T  with  the  corresponding  subset  of  I: 

[Ku  ■  ■  •  Kn_x]  =  {[Ku  •  ■  •  Kn_x]v  :  v  G  Ku}, 


together  with  the  analogous  identification  for  [0\. 

We  now  define  the  probability  measures  p,  p  on  §  as  follows: 


P(A)  = 

1 1  a(x)  Uia+  pv{i)(xcd\  x*)  gv^(x\  Y*$)  ^^(dx1)  II[t]eib  P[t](dx[t]) 
f  El iei+  Pv^(xC{i),xi)  gv^{x\  Y$)  dx *)  II[t]e3b  Plt](dx[t]) 

P(A)  = 

f  lA(x)  n*e/+  pv(i)(xc^\  x*)  gv(i)(x\  Y*$)  ^^(dx1)  II[t]eib  v[t](dx[t]) 
f  El iu+  Pv{i)ixC{^,xi)  gv{i)(x\  Y$$)  ^{dx^  n[t]eTo  v[t](dx[t]) 


where  we  write  plK°mmmKn- d  :=  pKo  and  °'"Kn~P  ■=  uK°  for  simplicity.  Then,  by 
construction,  the  measure  BAFn  •  ■  ■  Fip  coincides  with  the  marginal  of  p  on  the  root 
of  the  computation  tree,  while  BAFn  •  ■  •  coincides  with  the  marginal  of  p  on  the 
root  of  the  computation  tree.  In  particular,  we  obtain 


Fi/i  -  Fn  ■  ■  ■  Fii/||, 


\\P~  P\\[0]j- 


We  will  use  Theorem  2.11  to  obtain  a  bound  on  this  expression. 

Throughout  the  remainder  of  the  proof,  we  specialize  to  the  case  that  p  =  8Z 
and  v  —  8Z'.  To  apply  Theorem  2.11,  we  must  bound  the  quantities  CtJ  and  bi  with 
i  =  [Ku  ■  ■  ■  Kn_ i]w  and  j  =  [K'u,  ■  ■  ■  K'n_^\v' .  We  distinguish  three  cases. 

Case  u  —  0.  As  p  =  Sz  is  nonrandom  we  evidently  have  plx  =  5zv,  so  that  Ct]  =  0. 
On  the  other  hand,  as  plx  =  5z>v,  we  cannot  do  better  than  bi  <  2. 

Case  0  <  u  <  n.  Now  we  have 


pl(A)  =  px(A)  = 


J  lA(xl)  gv(x\  Y”)  pv(xc('l\xl)  ri/e/+:iec(q  pvd>{xcd>,  xc)  ipv(dx 
1 9v(xi ,  YJ)  pv(xdP,  xl)  El fG/+:tec(0  Pv{e) (x<:(e) , x£)  ^v( dx l) 
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Thus  bi  =  0.  Moreover,  by  inspection,  plx  does  not  depend  on  ad  except  in  the 
following  cases:  j  G  c(i);  i  G  c(j);  j  G  c(£)  for  some  £  G  /+  such  that  i  G  c(£).  As 

estimate  using  Lemma  A.l 

e2  if  j  G  c(i), 

£2  if  i  G  c(j), 

£2A  ifjeUfe;+:l£c«)CW, 

otherwise. 


e2)e^A  +  (1  -  £2A)A2  <  3(1  -  £2A)e/3A2, 
ier 

where  we  have  used  that  /3  >  0  and  A  >  1  in  the  last  inequality. 

Case  u  =  n.  Now  i  =  [0]w,  so  we  have 

i(  u  -  ~i(  I1  A{xi)gv{x\Y^)pv(xc^,xi)^v(dxi) 

Px[  }  Px[  ’  f  ' 

Arguing  precisely  as  above,  we  obtain  bi  =  0  and 

^ePm-d{j)\Cij  <  _  £2)e^A. 

ie/ 

Combining  the  above  three  cases,  we  obtain 

max^e/3|d«-dO)lQi  <  3(1  -  £2A)e/3A2  =  - 
l£l  J£l 

by  the  assumption  of  the  Proposition.  Thus  by  Theorem  2.11 

|| Fn  ■  •  •  FnL  -  Fn  •  ■  •  F \SZ' ||  j  =  \\p  -  ~p\\[0]J  <  4 card  Je~Pn, 

where  we  have  used  Lemma  2.13  with  m(i,j )  =  /3\d(i)  —  d(J)\.  The  proof  is  completed 
by  extending  to  general  s  <  n  as  in  the  proof  of  Proposition  A. 2.  □ 

The  proof  of  Proposition  A.  13  was  simplified  by  the  fact  that  the  resulting  bound 
holds  uniformly  for  all  point  mass  initial  conditions  (this  could  be  used  to  obtain  a 
uniform  bound  for  all  initial  measures  along  the  same  lines  as  the  proof  of  Corollary 
A. 5).  To  obtain  a  bound  on  the  variance  term,  however,  we  require  a  more  precise 
stability  bound  for  the  block  filter  that  provides  explicit  control  in  terms  of  the  initial 
conditions.  We  will  shortly  deduce  such  a  bound  from  Proposition  A.  13.  Before  we 
can  do  so,  however,  we  must  prove  a  refinement  of  Lemma  2.9. 


card  c(l)  <  A  for  every  i  G  /+,  we 


Cij  <  < 


This  yields 

^2emi)-d(j)\Cij  <  2(1 
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Lemma  A.  14.  Let  p,  —  p1  ®  •  •  •  <g)  pd  and  v  =  is1  ®  ■  ■  ■  ®  ud  be  product  probability 
measures  on  §  =  S1  x  •  •  •  x  and  let  A  :  §  — s-  M  be  a  bounded  and  strictly  positive 
measurable  function.  Define  the  probability  measures 

(As  I  lA(x)A(x)n(dx)  _  f  lA(x)A(x)u(dx) 

^ A  f  A (x)p(dx)  ’  A  f  A (x)n(dx) 

Suppose  that  there  exists  a  constant  e  >  0  such  that  the  following  holds:  for  every 
i  =  1, . . . ,  d,  there  is  a  measurable  function  A*  :  §  — >■  M  such  that 

sAl(x)  <  A(x)  <  £~1Al(x)  for  all  x  e  § 

and  such  that  Al(x)  =  Al(x)  whenever  xfb-><i}\{*}  —  rjpffien 


||ma-^a||  <  -  vl\\. 

i= 1 

Proof.  Define  for  i  —  0, . . . ,  d  the  measures 

[ 

Pi  ■■=  v1  ®  ■  ■  ■  ®  A  ®  pl+1  ®  ■■■  ®  pd,  Pi,k(A)  \=  — 
(by  convention,  po  =  p  and  pd  —  v).  Then  we  can  estimate 

d 


1aQz)A  (x)pfidx) 
fA(x)Mx) 


\\PA  Va\\  ^  ^  ^  ||Pz,A  Pi— 1 , A | 


i= 1 


Now  note  that  we  can  estimate  for  |/|  <  1 

1 


Mf)  ~  A-i,a(/)|  < 


£Pi(  A* 


I  Pi(fA  -  Pi-i  (/A)  |  +  |  Pi  (A)  -  pi-i  (A)  | 


as  in  the  proof  of  Lemma  2.9.  Moreover,  we  can  write 

A  (A 


I Pi(/A)  -  Pi-i  (/A)  |  = 
|pi(A)  —  Pi—!  (A)  |  = 


£ 

A  (A 


f\x)v\dxl)  -  J  fl{x)p\dxl) 
gl {x)vl (dxl)  —  [  gl (x)  fi1  (dxl) 


where  f1  and  gl  are  functions  on  S*  defined  by 


f  (A)  := 


9'(x')  := 


A  (A 

£ 


f(x) A(x)  vfidx1)  ■  ■  ■  vi~1{dxi~1)  pi+1(dxi+1)  ■  ■  ■  pd(dxd), 
A(x)  vfidx1)  ■  ■  ■  S-fidP-1)  pl+1(dxi+l )  ■  ■  ■  pd(dxd). 


A  (A) 

Evidently  \fl\  <  1  and  \gl\  <  1,  and  the  proof  follows  directly. 


□ 
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We  can  now  obtain  a  stability  bound  with  control  on  the  initial  conditions. 


Proposition  A.  15.  Suppose  there  exists  e  >  0  with 

£  <  pv(x,  zv )  <  £_1  for  all  v  G  V,  x,  z  E  X 

such  that 


£>  £o=  1  - 


6A2 


1/2A 


Let  /3  =  —  log6A2(l  —  e2A )  >  0.  Then  for  any  product  probability  measures 


//  =  ^  /i 

K& C 


A' 


kgx 


K 


we  have 


|F„  ■  ■  ■  F s+iH  -  Fn  •  ■  ■  Fs+H|j  <  card  ^  a*||/^  - 


,Aj 


KeX 


for  every  s  <  n,  K  e  A,  and  J  C  AT.  i/ere  («^) ^e3c  ore  nonnegative  integers, 
depending  on  J  and  n  —  s  only,  such  that  XIa '&xaK  A  AJ_S . 

Proof.  We  fix  s  =  0,  n  >  0,  K  el,  JC  A'  as  in  the  proof  of  Proposition  A.  13,  and 
adopt  the  notation  used  there.  Define  the  functions 

hA(xT°)  :=  f  lA{xWJ)  ]\pW(x*\xi)gv«\xi,Y$)r®(dxi), 

J  i£l+ 


h{xT°)  := 


Yl  pv^\xc^,  x1)  gv{i)(x\  Y$)  ^v{i\dxi 
iei+ 


on  the  leaves  Tq  of  the  computation  tree,  for  every  measurable  A  C  XJ.  Then 

f  hA(xT°)  Tlmpr  pfiidx®)  [  Iia(xt°)  ^ 

„,/=  p^AJ- a(dxT°), 


J  h(xT°)Yl[t]eT0T[t](dxW)  J  h(x 


where  we  define  the  measure 


jl{A)  := 


I  U(xTo)  h(xT°)  nWeTo  P[t](dx[t]) 


Jh(xT °)  U[t]eT0P[t](dx[t]) 
The  measure  v  is  define  analogously,  and  we  have 


|  F n  ■  ■  ■  Fi n~  F n  ■  ■  •  Flu'll  J  =  2  sup 

ACXJ 


hA 


hA 


T®-  ~hdc 


where  the  supremum  is  taken  only  over  measurable  sets.  But  note  that  hA/h  is 
precisely  the  filter  obtained  when  the  initial  condition  is  a  point  mass  on  the  leaves 
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of  the  computation  tree  (albeit  not  with  the  special  duplication  pattern  induced  by 
the  unravelling  of  the  original  model;  however,  this  was  not  used  in  the  proof  of 
Proposition  A.  13).  Therefore,  the  proof  of  Proposition  A.  13  yields 


2  sup  sup 

z,zSXT0  ACXJ 

In  particular,  using  the  identity  |  //(/)  —  u(f)  \  <  ^  osc  /  ||/x  —  z/||,  we  obtain 
|| F„  •  •  •  F !/i  -  Fn  •  •  •  Fiz/||  j  <  2  card  J e~,8n  \\fi  -  v ||. 


hA(z)  hA(z ) 


h(z )  h(z ) 


<  4  card  J  e  /3n. 


We  now  aim  to  apply  Lemma  A.  14  to  estimate  ||/2  =  z/||. 

To  this  end,  consider  a  block  [t]  G  Tq.  The  integrand  in  the  definition  of  h(xT° ) 
depends  only  on  through  the  terms  pv^>  {xc^\  xl)  with  c{i )  D  [t]  ^  0.  If  we  write 
[t]  =  [A'o  •  •  •  Kn- 1],  then  c(i )  D  [t]  ^  0  requires  at  least  i  G  [Ki  ■  ■  ■  Kn_  x ]  and  therefore 
card{f  G  /+  :  c(f)  D  [t\  ^  0}  <  card  K {  <  |0C|oc .  Thus  we  have 

<  £-\%\^h}t\z) 


for  every  z  G  XT°  and  [f]  G  To,  where 


hw(a;To)  := 


n  p'^’V)  n 

j£r+:c(i)n[t]=0  iS/+ 


does  not  depend  on  x^.  By  Lemma  A. 14,  we  obtain 


-  z/||  < 


£2|3C|c 


E  Ha1*1  - 


„W||  = 


WeTo 


£2|3C|c 


Y  UK>  \WK>  -  V 


K'\ 


K'&X 


where  we  define  ax'  =  card{[A"0  •  •  •  Kn_ i]  G  T0  :  K0  =  K'}.  As  the  computation  tree 
has  a  branching  factor  of  at  most  Ax,  we  evidently  have  Jf,K£XaK  =  cardT0  < 

The  result  therefore  follows  directly  for  the  case  s  =  0,  and  the  general  case  s  <  n  is 
immediate  as  in  the  proof  of  Proposition  A. 2.  □ 

We  finally  state  the  block  filter  stability  bound  in  its  most  useful  form. 


Corollary  A. 16  (Block  filter  stability). 

£  <  pV(x,  Zv)  <  £~l 


Suppose  there  exists  e  >  0  with 
for  all  v  G  V,  i,zGX 


such  that 


£  >  £q  —  1 


6AkA2 


1/2A 


Let  (3  =  —  log 6AxA2(1  —  £2A)  >0. 

Then  for  any  (possibly  random)  product  probability  measures 


kgx  kgx 
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we  have 


2 

j 


< 


£2\X\c 


card  Je~^n~s) 


max 

kgx 


v'E 


,K\\2 


for  every  s  <  n,  K  e  % ,  and  J  C  K . 

Proof.  The  result  follows  readily  from  Proposition  A.  15  (note  that  we  have  now  ab¬ 
sorbed  the  branching  factor  in  the  definition  of  /3).  □ 


A. 6  Bounding  the  variance 

To  complete  the  proof  of  Theorem  4.2,  it  now  remains  to  bound  the  variance  term 
lll^n  —  dj,  If  j  uniformly  in  time.  This  is  the  goal  of  the  present  section.  We  will  first 
obtain  bounds  on  the  one-step  error,  and  then  combine  these  with  the  block  filter 
stability  bound  of  Corollary  A.  16  to  obtain  time- uniform  control  of  the  error.  The 
main  remaining  difficulty  is  to  properly  account  for  the  fact  that  Corollary  A.  16  is 
phrased  in  terms  of  the  total  variation  norm  ||  ■  ||  j,  which  is  too  strong  to  control  the 
sampling  error  (we  do  not  know  how  to  prove  an  analogous  result  to  Corollary  A.  16 
in  the  weaker  |||-||| /-norm).  To  this  end,  we  retain  one  time  step  of  the  block  filter 
dynamics  in  the  one-step  error  (we  control  ||  Fs+1Fsd*_1  —  Fs+1  Fs7T^_1  ||  rather  than 
HlMf-i  —  FyTyjy-JI^),  which  allows  us  to  exploit  the  fact  that  the  dynamics  P  has  a 
density. 

Let  us  begin  with  the  most  trivial  result:  a  one-step  bound  in  the  |||-|||  j-norm. 
This  estimate  will  be  used  to  bound  the  error  in  the  last  time  step  s  =  n. 

Lemma  A.  17  (Sampling  error,  s  =  n).  Suppose  there  exists  k  >  0  such  that 

k  <  gv(xv ,  yv )  <  fiT1  for  all  v  e  V,  x  G  X,  ye  Y. 


Then 


Proof.  Note  that 


max  |||Fn7r^_1 
Kex 


F  it'1 
rnVl 


2 k  2I3CI°° 

Vn 


p  jrV  _ p  iii  —  III rK R^P 7P1  —  CK R^S^PfP1 

rnV  1  k  ~  IIAn  D  r7ln- 1  WD  °  r7rn - 


<  2k 


—2  card  K 


iiip*;_1-snp<_1|||< 


2k 


—2  card  K 


Vn 


where  the  first  inequality  is  Lemma  2.9  and  the  second  inequality  follows  from  the 
simple  estimate  |||/i  —  SN fi\\\  <  l/\/N  that  holds  for  any  probability  y.  □ 

For  the  error  in  steps  s  <  n,  the  requisite  one-step  bound  (Proposition  A. 20)  is 
more  involved.  Before  we  prove  it,  we  must  first  introduce  an  elementary  lemma 
about  products  of  empirical  measures  that  will  be  needed  below. 
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Lemma  A.  18.  For  any  probability  measure  p,  we  have 

III  -  ^d 


-  ^  w\  < 


Vn' 


where  p  =  jj  ^xk  an< d  -^i>  •  •  • ,  XN  are  i.i.d.  ~  p. 

Proof.  We  assume  throughout  that  A  >  d2  without  loss  of  generality  (otherwise  the 
bound  is  trivial).  Let  |/|  <  1  be  a  measurable  function.  Then 


N 


iPiU)  =  Wd  E  /(W......WJ- 


We  begin  by  bounding 


k1:...,kd=l 


N 


N 


Var|A  “(/)]  = 


]y2d 


where 


n, . kd  :=  /(w„ . . ,,xkd) -  E  f(xkl, . . ,,xkd). 

Note  that  E  (Eftl . kdFk'v . yd)  =  0  when  {fci,...,  kd)  n  {k[,  ...,kd}  =  0.  Thus 

,  N  N 


Var[/2®d(/)]  < 


N2d 


Y  Y 


kd}n{k'1,...,k'd}^0i 


where  we  use  \Fkli^kd \  <  2.  But  for  each  choice  of  ki,...,kd,  there  are  at  least 
(A  —  d)d  choices  of  k\ , . . . ,  k'd  such  that  {ki, . . . ,  kd}  D  { k\ , . . . ,  k'd}  =  0,  so 


iV2d 

We  can  therefore  estimate 

111^-/6^11  <  \\p®d  —  E  /t0d||  + 
<  ||/i0d  —  E  /t0d||  + 


4d2 

<  - . 

“  A 


||  E  p®d  —  /d 
2  d 


Vn' 


It  remains  to  estimate  the  first  term.  To  this  end,  note  that  E  f(Xkl, . . . ,  Xkn 
p®d(f)  whenever  k\  ^  ^  kn.  Therefore,  we  evidently  have 


N 


|E  kaV)  ~  km(f)\  <  jfd  E 

fcl,...,fcd=l 

1  A! 


<21- 


<2(1—(1  —  — 
A 


Nd  (A  —  d)\ 

But  as  A  >  d2,  we  have  d2/A  <  d/y/N.  The  result  follows. 


< 


2d2 

~N' 


□ 
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This  result  will  be  used  in  the  following  form. 

Corollary  A. 19.  For  any  subset  of  blocks  L  C  X,  we  have 

IIIgWB'V  -  (gvEiB'ts'V|||  < 

for  every  probability  measure  p  on  X  and  s  >  1. 

Proof.  Write  ft  :=  SN p  and  d  =  card£,  and  let  us  enumerate  the  blocks  L  = 
{Ad,  •  •  • ,  Adj}.  Then  for  any  bounded  function  /  :  XU£  — >■  M,  we  can  write 

(<8WB^)(/)  =  J  f(x1i\---iX*d)p(dx1)---p(dxd), 

(®A'e£BASJV>(/)  =  f  f{xi\...,x*d)p(dx1)---p{dxd). 

Thus  evidently 

III®k«b'V  -  <  111/*“  -  n 

and  the  result  follows  from  Lemma  A.  18.  □ 


We  now  proceed  to  prove  a  one-step  error  bound  for  time  steps  s  <  n. 
Proposition  A. 20  (Sampling  error,  s  <  n ).  Suppose  there  exist  e,  n  >  0  with 
e  <  pv(x,zv )  <  e-1,  k  <  gv(xv,  yv)  <  a-1  Vn  e  V,  x,z  e  X,  ye  Y. 

Then 

lG/Xod  s~ ^l^l00  fc— 


maxVEHFs+lF^-i  -  Fs+iFs^-iWk  < 

for  every  0  <  s  <  n. 

Proof.  We  begin  by  bounding  using  Lemma  2.9 


Vn 


||FS+1F s^s-i  ~  F S+1F s^-\\\k  =  ||Cf+1B^PF^_i  -  Cf+1B^PFS^_1|| 

<  2n-2m°°\\BKPFsTTlf_1  -  BA:PFa7r^_1||. 

Now  note  that 

(BA  PF  sTtf_1)(dxK) 
ifK(dxK) 

f  n  veKpv(z>  xV )  n  k'gn(k)  n,/ex'  gv\zv\  Yf)(^K'pK-\)(dzK') 

I  n  ru^  >r)(BAPCi)(^') 

(BKPFs7Tf_1)(dxK) 


ifjK(dxK) 

I  Urmio  n«A-  Y/)(B«^Pni_Mdzn 
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where  ifK(dxK)  :=  \\veK  ,4’v(dxv),  and  we  can  write 

HB^PF^-B^PF^.J  = 

( BA  P  Fs7fs_i)  (dxK)  (BKPFs^_1)(dxK) 


1 ipK(dxK )  ifK(dxK) 

We  therefore  have  by  Minkowski’s  integral  inequality 


ipK(dxK). 


E  ||BAP Fa7^_!  -  BA'PFs7r,M_1||2 


< 


f 

E 

(B^PF^Xdo^) 

(BAPF  s^s-i){dxK) 

l\ 

ifK(dxK) 

,ijjK(dxK) 

K(iK\ 


4>  (dx 


<V>  (x  )  sup 

xK&tK 


\ 


E 


(BAPFs7r^_1)(dxA)  (BKPFs^_1)(dxK) 


ifK(dxK) 


ifK(dxK) 


As  we  have 


£ipv(Xv)  <  /  pv(x,zv)ifv{dzv)  =  1,  pv(z,  xv)  <  e 


and 


v£K 


n  n <  *-«»**, 


_|3C|ooAx 


K'eN(K)  v'eK1 

we  can  apply  Lemma  2.9  to  estimate 


E  HB^PF^!  -  BAPFS7TSA1_1||2 
<  2e~2W°° K~2m°°Ax III ®K,eN{K) Bx' Ptt^  - 

By  Corollary  A.  19  (applied  conditionally  given  7Tg_1),  we  obtain 

ft  An/*  p~2|3C|oo  2|3C|qq Ajc 

'E  BA  PF,~('  ,  -  BAPF s7T^_i||2  <  305  - • 

The  result  follows  immediately. 

We  hnally  put  everything  together. 

Theorem  A. 21  (Variance  term).  Suppose  there  exist  e,  k  >  0  with 

e  <  pv(x,  zv)  <  £_1,  k  <  gv(xv,yv)  <  k~1  Vu  e  V,  x,  z  E  X,  y  e  Y 
suc/i  that 

£  >  £°  =  i1  ~  6A^ 

Let  /3  =  —  log  6A;jcA2(1  —  £2A)  >  0.  Then 

6AAxe^  £-4IVook-4|.v|ooAx 


1/2A 


\K~KW\j  ^cardJ1_e^ 


v^v 


for  every  n  >  0,  x  G  X,  A'  G  %  and  J  C  A". 


□ 
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Proof.  We  begin  with  the  elementary  error  decomposition 


\Tl„  7T„ 


S=  1 


|Fn  •  ■  ■  F s+iF s^s-i  ~  F 


's+i 


Fc7T 


The  term  s  =  n  in  this  sum  is  bounded  in  Lemma  A.  17: 

2k~2^ 

ill  r  r  ~x  ^ 

|||rn7Tn_i  Fn7Trl_1  HI j  < 


Vn 

The  term  s  =  n  —  1  is  bounded  in  Proposition  A. 20: 

£■— 2|9C|cx>  p*— 4|3C|ooA3c 


|  FnFn_i7Ts_^  FnFn_17Ts_1|||  ,  ^ 


y/N 


Now  suppose  s  <  n  —  1.  Then  we  can  estimate  using  Corollary  A. 16 

III F^  •  •  •  Fs+Jsir^  Fn  •  •  •  Fs+i Fs7ts_1  |||  , 


-  £2|3C|oo 


card  J  e  ,3(n  s  1}  max  Je  ||  Fs+i  -  F^FsT^I 

K&X  V 


2 

K' 


Applying  Proposition  A. 20  yields 


|  Fn.  '  '  '  Fs_|_iFs7Ts_i  Fn  •  •  •  Fs+iFs7Ts_i  IIIj, 

64Z\^k;  £~ ^l^l00  fc— 


<  card 


Vn 


Substituting  the  above  three  cases  into  the  error  decomposition  and  summing  the 
geometric  series  yields  the  statement  of  the  Theorem.  □ 

Theorems  A.  12  and  A. 21  now  immediately  yield  Theorem  4.2. 
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Appendix  B 

Localized  Gibbs  sampler  particle 
filter:  proofs 


The  goal  of  this  section  is  to  prove  Theorem  5.4.  What  follows  directly  builds  on  the 
discussion  presented  in  Section  5.6. 

The  key  idea  to  bound  ||  Fnp—  Fnp||  j  is  to  use  the  one-sided  Dobrushin  comparison 
theorem  (Theorem  2.12)  to  capture  the  one-sidedness  that  is  embedded  in  the  Gibbs 
samplers  F np  and  F np.  To  this  end,  we  need  to  bound  the  one-sided  coefficients  Cif s 
and  bfs.  This  will  be  achieved,  respectively,  using  Proposition  B.l  and  Proposition 
B.2  below;  the  proofs  of  these  two  propositions  are  based  on  the  original  Dobrushin 
comparison  theorem  (Theorem  2.11). 


B.l  Preliminary  steps  with  Dobrushin  comparison 
theorem 

The  following  proposition  will  be  used  to  bound  the  Cif s  coefficients  in  the  one-sided 
comparison  theorem. 

Proposition  B.l.  Suppose  there  exists  e  >  0  such  that 

£  <  pv(x,  zv )  <  £~l  for  all  v  €  V,  x,  z  E  X. 

Let  v  be  a  probability  measure  on  X,  and  suppose  that 

Corr(z/,  p)  +  (1  -  £2)e/3(r+1)A  <  c  <  1 

for  a  sufficiently  small  constant  j3  >  0.  Fix  n  >  1,  and  write  r]v  for  rff  v .  For  each 
v,  v'  G  V  define 

Rw' -=\  sup  ||^-^||- 
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Then, 


max  V  epd{v’v,)Rvv/  < 
vev 

v'ev 


1  —  c 


Proof.  Henceforth,  fix  n  >  1,  v ,  v'  G  V  such  that  u  7^  v'  and  x,  x  G  X  such  that 
^YK}  —  _  por  simplicity,  write  rf  for  rff  v.  Dehne  /  =  ({0}  xf)U  (l,u)  and 

§  =  X  x  Xv,  and  the  probability  measures  on  § 


P(A)  := 
p(A)  := 


I  1  A{z,  u)  gv{uj,  Y£)  n,renw  PW(Z’  pV(z’  ^  W 

f  ffv(^,Yn)  nt„ey\M  Pw{z,xw)pv(z,u)  is(dz)  ifv(du;) 

I  1  A(z,  u)  gv(co,  Y*)  n,rgy\W  Pw(z ,  *w)  Pv(z,  u)  "(dz)  V{du) 

Jgv(u,Yn)  Tlu,ev\{v}Pw(ziXw)Pv(ziUJ)  u(dz)ifv(duj) 


By  construction,  for  any  bounded  measurable  function  /  on  X1’  we  have 


Kf-vlf  1  = 


p(dz,duj)f(uj)  —  /  p(dz,du)f(uj ) 


and  we  will  now  proceed  by  applying  the  Dobrushin  comparison  theorem  (Theorem 
2.11)  to  bound  this  quantity.  To  this  end,  we  must  bound  C'p  and  bi  with  i  =  (. k ",  v") 
and  j  =  (k'"  ,v"').  We  distinguish  two  cases. 

Case  k"  =  0.  In  this  case  we  have 


P\z,u>)  (^4) 
P[z,Lj)  (^) 


flA(zv")  nu,eN(v")\{v}PW(z,xW)PV(z,UJ)vf{dzv") 
I  nweN(v»)\{v}  Pw(z,xw)pv(z,u)  vf(dzv") 

I  m^")  n„,e  n(v")\{v}Pw(z>£w)pv(z>u;)  "z\dzV") 

I  n weN(v")\{v}Pw(z,xw)Pv(z’UJ)  K"(dzv ") 


In  particular,  p}z  xV^  =  v^x,  so  <  C”„v,„  if  k'"  =  0.  Moreover,  as 


i  f  A)^  2  f  1a(zV")  nweN(v")\{v}PW(z’xW )  vVz'\dzv") 

/w  /n \„)p”(*-,^)^r(d*-") 


we  have  Cp  <  1  —  e2  if  k"'  =  1  (so  v'"  =  v)  and  v  G  N(v'f)  by  Lemma  A.l,  and 
Cij  =  0  otherwise.  We  therefore  immediately  obtain  the  estimate 


y,  e^k"le^v"’v'")C^,v")(k/",v'")  <  Corr(y,  f3)  +  (1  -  e2)e/3(r+1). 

(k'",v'")ei 


On  the  other  hand,  note  that  p\z  uj)  =  pi  ^  if  v"  fL  N(v'),  and  that  we  have  pi  U^(A)  > 
e2y(H)  and  p\z^(A)  >  e2x(A),  where 


X(A)  ■= 


I  1  A(ZV")  El u,eN(v")\{v,v'}PW(Z’xV’)PV(Z’U)  "Vz"(dzV") 
/FI  weN{v'')\{vy}Pw(z^w)Pv(z^)<(dzv'') 
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Therefore,  by  Lemma  A.l 


h=  sup  ||P(^) < 

(2,aj)e§ 

Case  k"  —  1.  In  this  case  we  have 


0  for  v"  N(v'), 

2(1  —  e2)  otherwise. 


i  (A\=~i  (A)  =  /  M^)  9v{u,  Yn)pv(z,  u)  V{du) 

P{z'“){  ’  j  J  gv{u,Y^)pv(z,u)^{doj)  ' 

Thus  bi  =  0,  and  estimating  as  above  we  obtain  Ctj  <  1  —  e2  whenever  k'"  =  0  and 
v'"  G  N(v),  and  C%]  =  0  otherwise.  In  particular,  we  obtain 

£  ^11-n e«-'"')C(1, <  (1  -  £2)e«”+1»A. 

Combining  the  above  two  cases  and  the  assumption  of  the  Proposition  yields 
max  Y  v,%k,„  <  c. 

(k'",v'")ei 


Applying  Theorem  2.11  gives 

\K  ~  Vl\\  =  sup 

few-.\f\<i 


p(dz,duj)f(uj)  —  /  p(dz,duj)f(uj) 


<  2(1  —  e2)  y^  ^(i,V){o ,«")■ 

v"eN(v') 

As  the  previous  bound  does  not  depend  on  the  choice  of  x,  x  €  X,  as  long  as  xv^v'^  = 
we  have 


Rw'  =  X  sup  \\rfx  -  77!  1 1  <  (1  -  £2)  y  D{ 

,  v»eN(v>) 

By  Lemma  2.13  it  follows  that 


v£V 


v’ev 


< 

(1- 

-£2) 

max 

vev 

X 

e0d(v, 

v’) 

X 

D(  l,v)(0,v") 

v’&V 

V>> 

’eN(v') 

< 

(1- 

-£2) 

max 

v&V 

X 

efid(v 

,v")+Pr  £ 

v)(0,v")  ^  ^ 

v"ev 

v'£V 

< 

(1- 

-£2) 

max 

v£V 

X 

ePd(v,i 

n+pD{  i,w)(o,„») 

v"&V 

< 

(1- 

-s2; 

1  A  e^r~l) 

1 

and,  in  particular, 


max 

vev 


1  —  c 


^  '  R-w'  — 
v'ev 


(l-e^Ae^-1) 


1  —  c 


□ 
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The  following  proposition  will  be  used  to  bound  the  b/ s  coefficients  in  the  one¬ 
sided  comparison  theorem. 

Proposition  B.2.  Suppose  there  exists  e  >  0  such  that 

e  <  pv(x,  zv)  <  £~l  for  all  v  €  V,  x,  z  E  X. 

Let  is  be  a  probability  measure  on  X,  and  suppose  that 

Corr(z/,  /?)  +  (!-  e2)e^r+1) A  <  c  <  1 


for  a  sufficiently  small  constant  f3  >  0.  Fix  n  >  1,  and  write  pv  for  rj^v.  Then,  for 
each  v  G  V  we  have 


sup  \\t?x  -  fjvx 

x£X 


<  2 


(1  _  £2A)e-/3(2-r) 


/% 


Proof.  Henceforth,  fix  n  >  1,  v  €  V,  x  G  X.  For  simplicity,  write  r]v  for  rff  v  and  fjv 
for  ffn  v.  Define  I  =  ({0}  x  b)U(l,r)  and  §  =  X  x  X",  and  the  probability  measures 
on  § 


p{A)  :  = 
p{A)  :  = 


I  1a(z,  w)  YJ)  n„.gy\w  Pw(z , xW)  Pv(z,  u)  v{dz)  V{d^) 
f  gv(u,Yn)  n  u,£V\{v}  Pw(z’xW)Pv(z’UJ)v(dz)fv(duj) 

I  lA(z,  u)  gv(iv,  Y”)  n,,.givtw\M  Pw(z’ xW)  Pv(z » u(dz)  V{du) 

f  9v(u,  ynv)  UweNb(y)\{v}  Pw(z > xW)  Pv(z> w)  vidz)  fjv(duj) 


By  construction,  for  any  bounded  measurable  function  /  on  X^  we  have 


I  rfff-m 


J  p{dz,  du)f(uj) 


J  p(dz,  du)f(uj)  , 


and  we  now  proceed  by  applying  the  Dobrushin  comparison  theorem  (Theorem  2.11) 
to  bound  this  quantity.  To  this  end,  we  must  bound  Cl3  and  b,  with  i  =  (k',v')  and 
j  =  ( k",v ").  We  distinguish  two  cases. 

Case  k!  =  0.  In  this  case  we  have 


P(z,xv)  (^) 


P{z,xv)(^) 


1 UweNW)  PW(Z »  xW)  <\dzV') 
lUweN(v')Pw^xW)^'(dzV') 

I  1a(zv')  nW£N(v')nNb(v)PW(z’xW)  <\dzV') 
I  UweN(v')nNb(v)  Pw(z » xW)  K\dzV') 


In  particular,  p)z  x„,  =  is^'x ,  so  Cfj  <  C”,v„  if  k"  =  0.  Moreover,  as 


i  ,  ,X  ^  2  I1A(zV')UweN(v')\{v}PW(^xW)Uz(dzV') 
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we  have  CZj  <  1  —  e2  if  k"  =  1  (so  v"  =  v )  and  v  G  N{y')  by  Lemma  A.l,  and  CV,-  =  0 
otherwise.  We  therefore  immediately  obtain  the  estimate 

Y,  o, «){*>,«>)  <  Corr(z/,  0)  +  (1  -  e2)e^r+1\ 

(, k",v")ei 


On  the  other  hand,  note  that  p)zxv\  =  p\zxv\  if  N(v')  C  Nb(v),  and  that  we  have 
P\zxv)  >  £2A^~'  and  p^z  xy.  >  £2Avvz  .  Therefore,  by  Lemma  A.l 


bi 


SUP  II P\z,xv)  ~  P\z,xv) 
(z,xv)£  S 


< 


0 

2(1  —  e2A) 


for  v'  G  Nb(v)\dNb(v), 
otherwise. 


Case  k'  —  1.  In  this  case  we  have 

i  /  yj  \ _  ~i  (As  _  S  1a{;xv)  gv{;xv  pv{z,xv)^v{dxv) 

p(z’xv)[  j  P{z’x^{  }  f  gv(xv,Yv)pv(z,xv)^(dxv)  ' 

Tims  bi  =  0,  and  estimating  as  above  we  obtain  Ci3  <  1  —  £2  whenever  k"  =  0  and 
v"  G  N(v),  and  C{3  =  0  otherwise.  In  particular,  we  obtain 


Y  ^|1-fe"leM"y')C'(1,u)(fe»y0  <  (1  -  e  V<r+1>A. 

(. k",v")ei 


Combining  the  above  two  cases  and  the  assumption  of  the  Proposition  yields 

Y  <  c. 


max 

(k'v')ei 

(k",v")£l 

Applying  Theorem  2.11  and  Lemma  2.13  gives 


\K  -  vvx\\  =  sup 

/6X”:|/|<1 


<  2(1  —  £ 

2 


2A ' 


< 

< 


j  p(dz,du)f(u)  -  j  p(dz,  du)f(u) 
Y  D(l,v){0,v') 

v’eV\{Nb(v)\dNb(v)) 

2A \e-P  e-pd{v,dNb(y)) 


1  —  C 
2 

1  —  c 


(1  —  e2A)e-^  e 


_  2A\  -0(2-r)  -fib 


(1-0* 


where  in  the  last  inequality  we  used  the  fact  that  d(v,dNb(v ))  >  b  —  r  +  1.  As  the 
choice  of  x  G  X  was  arbitrary,  we  get 

(l  _  £2A\  -0(2 -r) 

sup  Wvl  -  fjvx ||  <  2 - - - e  pb. 

zSX  i  —  C 


□ 
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B.2  Proof  of  Theorem  5.4  with  one-sided  Do- 
brushin  comparison  theorem 


Using  Proposition  B.l  and  Proposition  B.2  we  can  now  apply  the  one-sided  Dobrushin 
comparison  theorem  (Theorem  2.12)  to  analyze  the  quantity  |F nv  —  Fnz/||j  and  to 
provide  a  bound  that  is  spatially  homogeneous  in  J  C  V.  As  explained  in  Section 
5.6,  the  key  intuition  behind  the  following  proof  is  that  both  the  filter  recursion  and 
the  approximate  filter  recursion  can  be  phrased  in  terms  of  Gibbs  samplers,  which 
can  then  be  easily  compared. 

Theorem  B.3.  Suppose  there  exists  e  >  0  such  that 

e  <  pv(x,  zv )  <  e”1  for  all  v  €  V,  x,  z  6  X. 


Let  v  be  a  probability  measure  on  X,  and  suppose  that 

Corr(z/,  /3)  +  (1  —  e2)e^r+r> A  <  c  <  1, 

(1  -  e2)  e^r+1>  A  ,  1 

- - - - <  c  <  1, 


for  a  sufficiently  small  constant  /3  >  0.  Then,  for  each  n  >  1  and  J  CV  we  have 


F n^W  j  A  2  card  J 


/e-^)(l-g2A) 

\  (l-c)(l-c') 


Proof.  Fix  n  >  1  and  J  C  V.  To  lighten  the  notation,  we  write  r]v  for  v  and  Gv 
for  G^u,  and  analogously  for  fjv  and  Gv.  By  construction,  for  each  v  e  V  the  kernel 
Gv  leaves  the  measure  F np  invariant,  that  is, 

(F„P)G”,P  =  F„p. 


Hence,  we  can  express  the  filter  recursion  as  m  sweeps  of  a  Gibbs  sampler,  namely, 


Fnf>=(FnP)(G^-G5‘)B. 

On  the  other  hand,  the  approximate  Gibbs  sampler  filter  recursion  reads 

F  nP-.=  P(Gl)p---G^pr- 

Therefore,  we  can  decompose  the  one-step  error  between  filter  and  approximate 
filter  as 


|| M  -  (MIIj  =  ^p  \(Fnu)(Gvl  ■  •  •  GvTf  -  "(GV1  ■  ■  ■  GVd)mf  I 

/ex-U/|<i 

<  sup  sup  1 6Z{GV1  ■  ■  ■  GVd)mf  -  5;(GV1  ■  ■  ■  GVd)mf\ 
fexj-.\f\<iz,zex 

<  sup  || 5, (G^1  •  •  •  GVd)m  -  5Z(GV1  ■  ■  ■  GVd)m\\j 

zex 

+  sup  || SZ(GV1  •  •  •  GVd)m  -  5z(GVl  •  •  •  GVd)m\\j.  (B.l) 

z,5EX 
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We  first  analyze  the  first  term  on  the  right  side  of  (B.l),  which  will  give  us  a 
bound  depending  on  the  approximation  parameter  b.  Henceforth,  fix  z  G  X.  For  any 
bounded  measurable  function  /  on  X  we  have 


/m  d 

sz(dX0)UU 

1=1  k=  1 


rfk  (x 


{«l.->Vfe_i}  Avk,—,vd} 


xfX  i  dxf)  f(xm ), 


m  d 


/no  u, 

n _ i  i 


x 


{vi,-,Vk-i  }lVk,—,Vd} 


\Vkl„.,Vdl  7  Vk\  ff  \ 
•"i—l  fUJjg 


=1  k= 1 
m 


Define  /  :=  UrLo(W  x  ^0  anc^  ^  :=  Define  the  probability  measures  on  § 

/rri  d 

5z(dx  o)  JJ  JJ  rjVk(x\vl,''',Vk-l}  x\v_k{'',Vd} ,  dxf)  lA(x), 

£=1  k= 1 

r,  m  d 

p(A)  :=  J  8z(dx0)Y[]\vVk{x\Vl,''',Vk~l}x\vJ:{'’Vd},dxvek)lA(x). 


1=1  k=  1 


By  construction,  we  have 


\8z{GVl  ■  ■  ■  GVd)mf  -  SZ(GV1  ■  ■  ■  GVd)mf\  = 


p{dx)f(xm)  -  /  p(dx)f(xm) 


We  want  to  use  Theorem  2.12  to  bound  this  quantity.  To  this  end,  let  r  be  defined 


as 


t  :  i  =  (£,  Vk)  G  /  — *  r(i)  =  £d  +  k, 
and  for  each  i  G  /,  x  G  §,  let 

7*  (A)  :=  p(Xi  G  =  xJ<r«)\W), 

%(A)  :=  p(Xi  G  =  xJ<rw\W). 

We  immediately  find  that  for  each  x  G  §,  £  G  {1, . . . ,  m},  k  G  {1, . . . ,  d},  we  have 

1^v\A)  =  ^v\A)  =  8AA), 

l^Vk\A)  =  rf”‘(xlvi’-'vk-l}xiv_k TVd},A), 

(A)  =  riVk{x{^’-Vk-l}xfX'M^ 4). 

Recall  the  following  definition  from  Proposition  B.l: 

Rvv>  ■■=  l  sup  \\rfx  -  ril\\  for  v,v'  G  V. 

^  x,x£X: 

xi\{v'}=£i\{v'} 

It  is  easy  to  check  that  for  each  i,  j  G  /  we  have 

Rvvt  if  i  —  (£,v),j  =  (£',v')  for  0  <  r(i)  —  r(j)  <  d  —  1, 


C*  = 


0  otherwise, 
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and  for  each  j  G  I  we  have 

b  =  f  supBeX  ||t£  -  fj” ||  if  j  =  (M),  l  >  1, 

1  0  otherwise. 

So,  by  Theorem  2.12  and  Proposition  B.2  (to  bound  the  h,-’s)  we  have 

IMG"1  •  •  •  GVd)m  -  •  •  •  GVd)m\\j  <Y^Y,  bi 

v&J  j€l 

(1  _  £2A)e-/?(2-r) 


<  2 


/3fc' 


1  —  C 


e  ~  y!  y!  D(m,v)j ■ 

v£j  j£l 


Moreover,  by  Proposition  B.l  we  have 


max 

iei 


Cv  =  ma*  ^  ^  Ryv'  — 


j&I  v'EV 

from  which  by  Lemma  2.13  it  follows  that 


- - i - <  d  <  1, 

1  —  c 


max 

i£l 


V  D„-  <  - . 

*J  -  i  __  d 


We  hnally  obtain 


IMG41  ■  •  •  GVd)m  -  8z(GVl  ■  ■  ■  GVd)m\\j  <  2  card  J 


(1  _  £2A)e-/3(2-r) 

(l-c)(l-cO 


0b 


(B.2) 


We  now  analyze  the  second  term  on  the  right  side  of  (B.l),  which  will  give  us 
a  bound  depending  on  the  iteration  step  m.  Henceforth,  fix  ^,i  G  X.  Define  the 
probability  measures  on  § 


m  d 


/lib  Lb 


r}Vk(x\vl,'"’Vk  l} x\v_k{"'Vd} ,  dxv£k)  1  A(x), 


=1  k= 1 
m  d 


/lib  Lb 

8z(dx 0)  JJ  JjM(^1,''’’,;fc_l}4-i’'’’,;d},  dxT)  1a{x). 

1=  1  k= 1 

By  construction  we  have,  for  any  bounded  measurable  function  /  on  X, 


MG™1  •  •  •  GVd)mf  -  8Z(GV1  ■  ■  ■  GVd)mf  |  = 


p(dx)f(xm)  -  /  p(dx)f(xm ) 


In  the  present  case  we  find  the  following  expressions  for  the  one-sided  conditional 
distributions,  for  each  x  G  §,  £  G  {1, . . . ,  m},  k  G  {1, . . . ,  d}, 

0°’V\A)  =  M-4), 

7<''’*>(y4)  =  7<'™>(y4)  =  . “-‘M-i''’"4,  ^)- 
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As  before,  for  each  i,  j  E  /  we  have 


0,'  = 


if  i  =  (£,  v),j  =  (£',v')  for  0  <  r(i)  -  r(j)  <d  —  1, 


0  otherwise, 
and  for  each  j  E  I  we  now  have 


bj  < 


2  if  r(j)  =  0, 
0  otherwise. 


By  Theorem  2.12  we  hnd 


\\SZ(G^  •  •  •  GVd)m  -  SZ(GV1  •  •  ■  GVd)m\\j  <  bi  <  2  E  E  DM(oyy 

vgj  jei  veJ  v'gv 


Proceeding  as  above,  by  Proposition  B.l  we  have 


max  y  ^  Cjj  <  d  <  1, 
jei 


from  which  it  follows  that 

_  00  ^  c/m 

D(mtv)(oy)  =  ^2  ^(m,v)(0y)  —  l  _  c/’ 
v'EV  n=m 

where  we  have  used  that  C%3  0  only  if  0  <  r(i)  —  r(j)  <  d  —  1.  We  finally  obtain 

Jrn 

||52(G1'1  •  •  •  GVd)m  -  8-z{Gvl  ■  ■  ■  GVd)ml\j  =  2  card  J  - — (B.3) 

As  the  choice  of  z,  z  E  X  is  arbitrary,  together  (B.2)  and  (B.3)  yield  the  statement 
of  the  Theorem.  □ 

The  proof  of  Theorem  5.4  follows  as  an  immediate  consequence  of  Theorem  B.3. 

Proof  of  Theorem  5.f.  In  Theorem  B.3,  choose  c  =  |  and  c'  =  Let 

(1  -  e 2)  e^r+1)  A  _  , 

C’ 

from  which  we  get  /3  =  l0g  8A{^£2)  >  0,  as  e  >  £0  :=  sjl  -  As  Corr(u ,  /3)  < 
by  assumption,  we  hnd 

Corr(i/,  (5)  +  (1  —  e2)e/3('r+1) A  <  7  +  c'(l  —  c)  =  7  <  7  =  c. 

4  8  2 
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4^- 1 1 — 1 


Hence,  both  assumptions  in  Theorem  B.3  hold,  and  for  each  n  >  1  and  J  C  H  we  get 

||Fni/ -  IFnz/||j  <  2  card  J  0e_/3(2_r)  (1 -e2A)  e_/36  +  ^  ^ 

<  |  card  J(e-^  +  e-(log4)m) 

<  a  card  Je-7min{6’m}, 


where 


2\\^=f  ft  _2A'  ^ 


«:=4(^(8A(1-£2))^i(1-£^)  +  3 


7  :=  mm 


log 


r  +  1  &  8A(1  —  e2) 


,  log  4 


□ 
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Appendix  C 

Comparison  theorems  for  Gibbs 
measures:  proofs 


The  first  part  of  this  appendix  (Sections  C.1-C.5)  is  devoted  to  providing  the  proofs  for 
the  generalized  comparison  theorems  introduced  in  Chapter  6  (Theorem  6.4,  Corollary 
6.8,  and  Theorem  6.12).  The  second  part  of  this  appendix  (Sections  C.6)  is  devoted  to 
developing  the  application  of  the  generalized  comparison  theorems  to  block  particle 
filters  (Theorem  6.13). 


C.l  General  comparison  principle 

The  proof  of  Theorem  6.4  is  derived  from  a  general  comparison  principle  for  Markov 
chains  that  will  be  formalized  in  this  section.  The  basic  idea  behind  this  principle 
is  to  consider  two  transition  kernels  G  and  G  on  §  such  that  pG  =  G  and  pG  =  p. 
One  should  think  of  G  as  the  transition  kernel  of  a  Markov  chain  that  admits  p  as 
its  invariant  measure,  and  similarly  for  G.  The  comparison  principle  of  this  section 
provides  a  general  method  to  bound  the  difference  between  the  invariant  measures  p 
and  p  in  terms  of  the  transition  kernels  G  and  G.  In  the  following  sections,  we  will 
apply  this  principle  to  a  specific  choice  of  G  and  G  that  is  derived  from  the  coupled 
update  rule. 

We  begin  by  introducing  a  standard  notion  in  the  analysis  of  high-dimensional 
Markov  chains,  cf.  [23]  (note  that  our  indices  are  reversed  as  compared  to  the  defini¬ 
tion  in  [23]). 

Definition  C.l.  (Kj)i,je/  is  called  a  Wasserstein  matrix  for  a  transition  kernel  G 
on  §  if 

osc jGf  <  ^2  OSQif  Vij 

i£l 

for  every  j  G  I  and  bounded  and  measurable  quasilocal  function  f. 

We  now  state  our  general  comparison  principle. 
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Proposition  C.2.  Let  G  and  G  be  transition  kernels  on  §  such  that  pG  =  p  and 
pG  =  p,  and  let  Qx  be  a  coupling  between  the  measures  Gx  and  Gx  for  every  x  G  §. 
Assume  that  G  is  quasilocal,  and  let  V  be  a  Wasserstein  matrix  for  G.  Then  we  have 

I Pf  ~  pf\<Yl  osc*/  Nl [  P^dx)  QxPi  +  OSCi-f  Vit  (P  0  P^L 

i,j£l  i,j£l 

where  we  defined 

n— 1 

N(n)  :=^Tvk, 

k= 0 

for  any  bounded  and  measurable  quasilocal  function  f  and  n  >  1. 

Theorem  6.4  will  be  derived  from  this  result.  Roughly  speaking,  we  will  design 
the  transition  kernel  G  such  that  V  =  I  —  W  +  R  is  a  Wasserstein  matrix;  then 
assumption  (6.1)  implies  that  the  second  term  in  Proposition  C.2  vanishes  as  n  — )■  oo, 
and  the  result  of  Theorem  6.4  reduces  to  some  matrix  algebra  (as  will  be  explained 
below,  however,  a  more  complicated  argument  is  needed  to  obtain  Theorem  6.4  in 
full  generality). 

To  prove  Proposition  C.2  we  require  a  simple  lemma. 


Lemma  C.3.  Let  Q  be  a  coupling  of  probability  measures  p,u  on  §.  Then 

\pf  ~  vf  \  <  J^osc if  Qrji 

iei 

for  every  bounded  and  measurable  quasilocal  function  f. 

Proof.  Let  J  G  J.  Enumerate  its  elements  arbitrarily  as  J  =  {ji,  ■  ■  ■ ,  jr},  and  define 
Jk  =  {ji,  ■  ■  ■  ,jk}  for  1  <  k  <  r  and  Jo  =  Then  we  can  evidently  estimate 

r 

l/*0)  -  fJx{z) I  <  \fJA^JXJk)  -  //(*Jfc-12JVh-1)|  <  ^oaCjfrjji^Zj). 

k= 1  j£j 

As  /  is  quasilocal,  we  can  let  J  f  I  to  obtain 

I  f(z)  -  f(z) I  <  J^osc ifrjiiz^Zi). 
iei 

The  result  follows  readily  as  \p,f  —  vf  \  <  f  | f(z)  —  f{z)\  Q(dz,dz).  □ 

We  now  proceed  to  the  proof  of  Proposition  C.2. 


Proof  of  Proposition  C.2.  We  begin  by  writing 
\pf-pf\  =  \pGnf-~PGnf\ 


n—  1 

<  \pGn~k~1Gk+1f  -  pGn~kGkf  |  +  | PGnf  -  pGnf\ 

k= 0 
n— 1 

=  ^  | pGGkf  -  pGGkf  |  +  | pGnf  -  pGnf\. 

k= 0 
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As  G  is  assumed  quasilocal,  Gkf  is  quasilocal,  and  thus  Lemma  C.3  yields 

I pGGkf  -  pGGkf  |  <  J  p(dx )  |  GxGkf  -  GxGkf\ 

<  f  p(dx)  Y  osc jGkf  Qxr]j 
J  jei 

<Yosc ifvj-  f  p(dx)  Qxr]j. 

i,jei  ' 

Similarly,  as  p  <8)  p  is  a  coupling  of  p,  p,  we  obtain  by  Lemma  C.3 

I PGnf  -  pGnf\  <  Y  osc jGnf  (p  ®  p)'n3  <  Y  oscif  Vtf  (p  ®  p)rj3. 

jei  i,jei 

Thus  the  proof  is  complete. 


□ 


C.2  Gibbs  samplers 

To  put  Proposition  C.2  to  good  use,  we  must  construct  transition  kernels  G  and  G 
for  which  p  and  p  are  invariant,  and  that  admit  tractable  estimates  for  the  quantities 
in  the  comparison  theorem  in  terms  of  the  coupled  update  rule  (yJ,  7J,  QJ ,  QJ)  jeg 
and  the  weights  (wj)j£g.  To  this  end,  we  will  use  a  standard  construction  called  the 
Gibbs  sampler :  in  each  time  step,  we  draw  a  region  Jed  with  probability  vj  oc  wj, 
and  then  apply  the  transition  kernel  to  the  current  configuration.  This  readily 
defines  a  transition  kernel  G  for  which  p  is  G-invariant  (as  p  is  7J-invariant  for  every 
Jed)-  The  construction  for  G  is  identical.  As  will  be  explained  below,  this  is  not 
the  most  natural  construction  for  the  proof  of  our  main  result;  however,  it  will  form 
the  basis  for  further  computations. 

We  fix  throughout  this  section  a  coupled  update  rule  (y7, 7J,  QJ ,  Q J) jeg  for  (p,  p) 
and  weights  (wj)  jeg  satisfying  the  assumptions  of  Theorem  6.4.  Let  v  =  (vj)j£g  be  a 
sequence  of  nonnegative  weights  such  that  ^ ~2jVj  <  1.  We  define  the  Gibbs  samplers 

Gx(A)  '■=  1a{x)  +  YVj  [  1A(zJxIV)lt{dzJ)i 

GVX(A)  ■=  (l- 1  a{x)  +  YVj  [  1A{zJxGJ)Yx{dzJ). 

V  Jea  /  Jea  d 

Evidently  Gv  and  Gv  are  transition  kernels  on  §,  and  pGv  =  p  and  pGv  —  p  by 
construction.  To  apply  Proposition  C.2,  we  must  establish  some  basic  properties. 

Lemma  C.4.  Assume  that  7J  is  quasilocal  for  every  Jed-  Then  Gv  is  quasilocal. 

Proof.  Let  /  :  §  — »  S  be  a  bounded  and  measurable  quasilocal  function.  It  evidently 
suffices  to  show  that  7 J  fJ  is  quasilocal  for  every  Jed-  To  this  end,  let  us  fix  J  e  d, 
x,  z  e  §,  and  Ji,  J2, . . .  €  J  such  that  Ji  C  J2  C  ■  ■  ■  and  Jj  =  /.  Then  we  have 

7 zJixi\Ji  - >  lz  setwise 
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as  7J  is  quasilocal.  On  the  other  hand,  we  have 


/: 


D  r  J  ... 

jz  pomtwise 


as  /  is  quasilocal.  Thus  by  [13,  Proposition  18,  p.  270]  we  obtain 


J  .pJ  i— >■  oov  j  fj 

'yzJixI\JiJzJixI\Ji  y  7z  Jz  ■ 


As  the  choice  of  x,  z  and  ( Jj)j>i  is  arbitrary,  the  result  follows. 

Lemma  C.5.  Assume  that  7J  is  quasilocal  for  every  J  G  3,  and  define 


□ 


W\ 


h=j  x  vj ’ 

JeS-ieJ 


1 

sup  — - r 

x,z£  S:  Vj(XjiZj) 

xl\{j}=zl\W 


X  yjQizVi- 

j£3'-i£J 


Then  Vv  =  I  —  Wv  +  Rv  is  a  Wasserstein  matrix  for  Gv . 

Proof.  Let  /:§—)•§  be  a  bounded  and  measurable  quasilocal  function,  and  let 
x,  z  G  S  be  configurations  that  differ  at  a  single  site  cardjf  G  /  :  X;  ^  zf}  =  1.  Note 
that 

lift  =  (lx  ®  SXI\ j)f,  7 zfz  =  (7 /  ®  <W)/- 

As  is  a  coupling  of  7^  and  7/  by  construction,  the  measure  QJX  z  0  Sxi\j  0  <L/\j 
is  a  coupling  of  7^  0  <5xj\j  and  7/  0  Thus  Lemma  C.3  yields 

|7x/x  -  Izfz  I  <  XI OSCi^  ®  ®  $zi\j)Vi 

iei 

=  ^osCif  Ql^i  +  X  OSC  ifrn(xi,Zi). 
i£j  zE/\  J 


In  particular,  we  obtain 

|GV/M  -  Gv/WI  <  ( 1  -  5>J ]  I/M  -  /Ml  +  £ A7/  -  7.7/ 

V  Je3  /  Je3 


P-E-d 

|  X^OSC*/  7d 

f,Xi,Zi)  +  XVj( 

Xosc*/<5j 

+  X  0 BCif  Vi(Xi,  Zi)  ) 

\  Jea  / 

is/ 

Jea  ’ 

s,  ieJ 

i£l\J  J 

=  X  0SQif  I1  -  Wii }  Vi(xi,  z%)  +  X  OSCi^  X  Vj  Qx -Ab¬ 
ie  1  iei  Je3:ieJ 

Now  suppose  that  x;rAA  —  zrA'A  (and  x  j -  z).  Then  by  definition 

X  Vj  ®izVi  <  RJj  Vj  (xj •'•/)• 

Je3:*eJ 
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and  we  obtain 


\Gvf(x)-Gvf(z)\ 

Vjfazj) 


<  osc jf  {1  -  WJj}  +  Y  osc if  RJj. 

iel 


Thus  Vv  =  I  —  Wv  +  Rv  satisfies  Definition  C.l.  □ 

Using  Lemmas  C.4  and  C.5,  we  can  now  apply  Proposition  C.2. 

Corollary  C.6.  Assume  that  is  quasilocal  for  every  J  G  3-  Then 

|p/  -  p/I  <  Y  OSCif  Nl{n)  osc*/  (J  -  wv  +  RV)ij  (P  ®  pH 

i,jei  i,jei 

for  every  n  >  1  and  bounded  and  measurable  quasilocal  function  f ,  where 

n— 1 

A'V(U  ;=  ^(/  -  IUV  +  ir)fc 

fc= 0 

and  the  coefficients  (aJ)je/  are  defined  by  aj  :=  YljedjeJVj  f*  P(dx)Qi.Vj- 

Proof.  Let  G  =  Gv,  G  =  Gv,  V  =  I  —  Wv  +  i?v  in  Proposition  C.2.  The  requisite 
assumptions  are  verified  by  Lemmas  C.4  and  C.5,  and  it  remains  to  show  that  there 
exists  a  coupling  Qx  of  Gx  and  Gx  such  that  f*  p(dx)  Qxrjj  <  aj  for  every  j  e  I .  But 
choosing 


Qxg  :=  I  1  -  YVj  f9{x,x)  +  YVj  /  QJx(dzJ,dzJ)  g(zJxIV,zJxIV), 

\  Jed  J  Jed  ' 

it  is  easily  verified  that  Qx  satisfies  the  necessary  properties.  □ 

In  order  for  the  construction  of  the  Gibbs  sampler  to  make  sense,  the  weights  Vj 
must  be  probabilities.  This  imposes  the  requirement  'YhjVj  <1.  If  we  were  to  assume 
that  J2jVJj  <  1,  we  could  apply  Corollary  C.6  with  Vj  =  wj.  Then  assumption  (6.1) 
guarantees  that  the  second  term  in  Corollary  C.6  vanishes  as  n  — y  oo,  which  yields 

OO 

|p/  -  pf\<Y  osc*/  Nit  ai  with  N  :=  -  w  +  R)k- 

i,jel  k=  0 

The  proof  of  Theorem  6.4  would  now  be  complete  after  we  establish  the  identity 

OO  OO 

N  =  Yi1  ~W  +  R)k  =  Y(W~lR)k  W~X  =  DW~\ 

k= 0  k= 0 

This  straightforward  matrix  identity  will  be  proved  in  the  next  section.  The  assump¬ 
tion  that  the  weights  Wj  are  summable  is  restrictive,  however,  when  /  is  infinite: 
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in  Theorem  6.4  we  only  assume  that  Wa  <  1  for  all  i,  so  we  evidently  cannot  set 

vj  =  Wj. 

When  the  weights  Wj  are  not  summable,  it  is  not  natural  to  interpret  them  as 
probabilities.  In  this  setting,  a  much  more  natural  construction  would  be  to  consider  a 
continuous  time  counterpart  of  the  Gibbs  sampler  called  Glauber  dynamics.  To  define 
this  process,  one  attaches  to  each  region  J  6  d  an  independent  Poisson  process  with 
rate  Wj,  and  applies  the  transition  kernel  at  every  jump  time  of  the  corresponding 
Poisson  process.  Thus  wj  does  not  represent  the  probability  of  selecting  the  region  J 
in  one  time  step,  but  rather  the  frequency  with  which  region  J  is  selected  in  continuous 
time.  Once  this  process  has  been  defined,  one  would  choose  the  transition  kernel  G  to 
be  the  transition  semigroup  of  the  continuous  time  process  on  any  fixed  time  interval. 
Proceeding  with  this  construction  we  expect,  at  least  formally,  to  obtain  Theorem 
6.4  under  the  stated  assumptions. 

Unfortunately,  there  are  nontrivial  technical  issues  involved  in  implementing  this 
approach:  it  is  not  evident  a  priori  that  the  continuous  time  construction  defines  a 
well-behaved  Markov  semigroup,  so  that  it  is  unclear  when  the  above  program  can 
be  made  rigorous.  The  existence  of  a  semigroup  has  typically  been  established  under 
more  restrictive  assumptions  than  we  have  imposed  in  the  present  setting  [36].  In 
order  to  circumvent  such  issues,  we  will  proceed  by  an  alternate  route.  Formally,  the 
Glauber  dynamics  can  be  obtained  by  an  appropriate  scaling  limit  of  discrete  time 
Gibbs  samplers.  We  will  also  utilize  this  scaling,  but  instead  of  applying  Proposition 
C.2  to  the  limiting  dynamics  we  will  take  the  scaling  limit  directly  in  Corollary  C.6. 
Thus,  while  our  intuition  comes  from  the  continuous  time  setting,  we  avoid  some 
technicalities  inherent  in  the  construction  of  the  limit  dynamics.  Instead,  we  now 
face  the  problem  of  taking  limits  of  powers  of  infinite  matrices.  The  requisite  matrix 
algebra  will  be  worked  out  in  the  following  section. 


Remark  C.7.  Let  us  briefly  sketch  how  the  previous  results  can  be  sharpened  to  obtain 
a  nonlinear  comparison  theorem  that  could  lead  to  sharper  bounds  in  some  situations. 
Assume  for  simplicity  that  YhjwJ  —  1-  Then  V  =  I  —  W  +  R  is  a  Wasserstein  matrix 
for  G  by  Lemma  C.5.  Writing  out  the  definitions,  this  means  5(Gf)  <  5(f)V  where 

{3Vh  -  Ua'  SUP  I  (  1  “  Yi  WJ  )  +  „  7~  .  \  U  Wj 


(here  we  interpret  f3  =  (fifii^i  and  5(f)  =  (osc if)i&i  as  row  vectors).  However,  from 
the  proof  of  Lemma  C.5  we  even  obtain  the  sharper  bound  5(Gf )  <  V[<5(/)]  where 


m, 


sup 

x,z€  §: 


+ 


”  J:i£j  J 


is  defined  with  the  supremum  over  configurations  outside  the  sum.  The  nonlinear 
operator  V  can  now  be  used  much  in  the  same  way  as  the  Wasserstein  matrix  V .  In 
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particular,  following  the  identical  proof  as  for  Proposition  C.2,  we  immediately  obtain 


where  Vfc  denotes  the  kth  iterate  of  the  nonlinear  operator  V.  Proceeding  along  these 
lines,  one  can  develop  nonlinear  comparison  theorems  under  Dobrushin-Shlosman  type 
conditions  (see  the  discussion  in  Section  6.3.2).  The  nonlinear  expressions  are  some¬ 
what  difficult  to  handle,  however,  and  we  do  not  develop  this  idea  further  in  this 
thesis. 


C.3  Proof  of  Theorem  6.4 


Throughout  this  section,  we  work  under  the  assumptions  of  Theorem  6.4.  The  main 
idea  of  the  proof  is  the  following  continuous  scaling  limit  of  Corollary  C.6. 

Proposition  C.8.  Let  t  >  0.  Define  the  matrices 

OC  OO  , 

N -W  +  R)k,  y[t]  :=  ~TT  (7  “  W  +  R)k- 

k= 0  fc=0 

Then  we  have,  under  the  assumptions  of  Theorem  6. 4, 

i  pf  -  pf\  <  osc^  Nv  ai  +  osc^  (■ p  ®  pfo 

i,j£l  i,j£l 

for  every  bounded  and  measurable  quasilocal  function  f  such  that  osc if  <  oo  for  all 

i  e  I. 


Proof.  Without  loss  of  generality,  we  will  assume  throughout  the  proof  that  /  is 
a  local  function  (so  that  only  finitely  many  osc  if  are  nonzero).  The  extension  to 
quasilocal  /  follows  readily  by  applying  the  local  result  to  fJr  and  letting  J  f  I  as  in 
the  proof  of  Lemma  C.3. 

As  the  cover  3  is  at  most  countable  (because  J  is  countable),  we  can  enumerate 
its  elements  arbitrarily  as  3  =  { Ji,  J2,  ■  ■  Define  the  weights  vr  =  (uj)jeg  as 

r _ J  wj  when  J  =  Ju  for  k  <  r, 

J  '  |  0  otherwise. 


For  every  r  G  N,  the  weight  vector  uvr  evidently  satisfies  Ylj  uvj  —  1  f°r  aH  u  >  0 
sufficiently  small  (depending  on  r).  The  main  idea  of  the  proof  is  to  apply  Corollary 
C.6  to  the  weight  vector  v  =  (f/n)vr,  then  let  n  — >  00,  and  finally  r  — »  00. 

Let  us  begin  by  considering  the  second  term  in  Corollary  C.6.  We  can  write 


(7  _  wit/n)vr  J^(t/n)vr\n 


-{I -Wvr  +  FTr)) 
n  J 


(/  -Wvr  +  FTr)k 


=  E  (/  -  Wvr  +  FTr)zfi 
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where  we  defined  the  Binomial  random  variables  Zn  ~  Bin (n,t/n).  The  random 
variables  Zn  converge  weakly  as  n  — >  oo  to  the  Poisson  random  variable  ~  Pois(f). 
To  take  the  limit  of  the  above  expectation,  we  need  a  simple  estimate  that  will  be 
useful  in  the  sequel. 

Lemma  C.9.  Let  (cj)j^i  be  any  nonnegative  vector.  Then 

J2(I-  VT'  +  RT)%  Cj  <  2k  max  £(I  -W  +  R%  c, 
jei  jei 

for  every  i  6  I  and  k  >  0. 

Proof.  As  Rv  is  nondecreasing  in  v  we  obtain  the  elementwise  estimate 
/  -  Wvr  +  Rvr  <  I  +  R  <  I  +  (/  -  W  +  R), 


where  we  have  used  Wa  <  1.  We  therefore  have 


X>  -  w'v'  +  RW’)%  cj<Y.(I+v-w+ =  £  ( f ) 

jei  jei  £=o  ^  '  jei 


W  +  RtjCj, 


and  the  proof  is  easily  completed. 


□ 


Define  the  random  variables 


X„  =  g(Zn)  with  g(k)  =  Y  osc if  (/  -  Wv’  +  R^)^  (p  ®  p)Vj- 

i,jel 

Then  Xn  — >  AA,  weakly  by  the  continuous  mapping  theorem.  On  the  other  hand, 
applying  Lemma  C.9  with  c3  =  (p  <g)  p)r)j  we  estimate  p(fc)  <  C'2A:  for  some  finite 
constant  C  <  oo  and  all  k  >  0,  where  we  have  used  assumption  (6.1)  and  that  /  is 
local.  As 


lirn  sup  sup  E(2z”T2wl>u)  <  lim  u  1supE4z"  =  lim  u  1e3<  =  0, 

u—¥  oo  n>  1  u^-oo  n>l  u^-oc 

it  follows  that  the  random  variables  (Xn)n>i  are  uniformly  integrable.  We  therefore 
conclude  that  E Xn  — >  EIM  as  n  — >  oo  (cf.  [31,  Lemma  4.11]).  In  particular, 

lim  V  osc,/  (/  -  W(t/n)vr  +  A(t/n)vr)”  (p  ®  p)p3  =  V  osc,/  (p  <g>  p)r}j, 
i,jei  i,jei 

where 

OO  .  _ j- 

l/r[tl  =  ^  -|p(/  -  ^ 
fc=0 

We  now  let  r  — >•  oo.  Note  that  Wv’  /  bP  and  i?vr  /  -R  elementwise  and,  arguing  as 
in  the  proof  of  Lemma  C.9,  we  have  I  —  Wvr  +  Rv  <  I  +  (I  —  W  +  R)  elementwise 
where 

oo  -f-k  _ t 

Y  ~JT  OSCif  (p®p)rh  <  e*  sup  Y  oscif  (I-W+Rtfj  {p®p)'Pj 

k= 0  i,jel 
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is  finite  by  assumption  (6.1)  and  as  /  is  local.  We  therefore  obtain 


lim  lim  £  osc if  (/  -  W{t/n)wr 


+  R(t/n)vryij  {p  ®  p)vj  =  osc*/  vif  (p  ®  p)vj 


by  dominated  convergence.  That  is,  the  second  term  in  Corollary  C.6  with  v  = 
( t/n)wr  converges  as  n  — >  oo  and  r  — y  oo  to  the  second  term  in  statement  of  the 
present  result. 

It  remains  to  establish  the  corresponding  conclusion  for  the  first  term  in  Corollary 
C.6,  which  proceeds  much  along  the  same  lines.  We  begin  by  noting  that 


1 

n 


n—  1 


n—  1 


^(/  -  W{t/n)wr  +  R{t/ 


n)vr\k  _ 


k= 0 


n 


n—l  k 

££ 

k= 0  1=0 


n 


k-i 


k= 0 

t 
n 


-  V  1--  )I+-(I-Wvr  +  R 
n  z — '  \  V  n  /  n 


(/  -  Wv‘  +  Rvr)£ 


n—l 

=  +  iry, 

1=0 


where  we  have  defined 


for  i  <  n.  An  elementary  computation  yields 


n  1  ^  £  _ g 

^^n)  =  l  and  p(n)  p(o°)  =  £_£ _dS. 

r=o 


We  can  therefore  introduce  {0, 1, . .  .}-valued  random  variables  Y„  with  P(Yn  —  £)  — 
p y'1  for  £  <  n,  and  we  have  shown  above  that  Yn  — >  Y, ^  weakly  and  that 


n—l 

-  ^(/  -  W(f/n)vr  +  R^n>r)k  =  E  (/  -  Wvr  +  Rvr)Yn. 

k= 0 

The  first  term  in  Corollary  C.6  with  v  =  (f/n)vr  can  be  written  as 


n—l 

osc  if  ~  W^t/n)vr  +  f?(f/n)v%  afn)wr  =  t  E  h(Yn), 

i,j£l  k= 0 


where  we  have  defined 


h{k)  =  J2  osc*/  ~  ^  +  Rvr)ij  <■ 
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We  now  proceed  essentially  as  above.  We  can  assume  without  loss  of  generality  that 


sup 

e>o  .  w 

-  i,jei 


Y  osc if  (/ 


W  +  R)ij  a,j  <  oo, 


as  otherwise  the  right-hand  side  in  the  statement  of  the  present  result  is  infinite  and 
the  estimate  is  trivial.  It  consequently  follows  from  Lemma  C.9  that  h(k)  <  C2k  for 
some  finite  constant  C  <  oo  and  all  k  >  0.  A  similar  computation  as  was  done  above 
shows  that  (. h(Yn))n>0  is  uniformly  integrable,  and  therefore  E h(Yn)  — »  E/r(yr00).  In 
particular,  the  first  term  in  Corollary  C.6  with  v  =  (t/n)vr  converges  as  n  — >  oo  to 


n—  1 

lim  Y  osc  if  V(J  -  W(t/n)vr  +  R(t/n)vr)k, 

n— >■ oo  J  ^  J  J 

i,j€l  k= 0 


where 


°°  ^  skc~s 


vr\k 


k= 0 


jfc! 


ds  (/  -  Wv  +RV  ) 


Similarly,  letting  r  — >  oo  and  repeating  exactly  the  arguments  used  above  for  the 
second  term  of  Corollary  C.6,  we  obtain  by  dominated  convergence 


71—1 


lim  lim  Y  oscif  Y^~  W(t/n)vr  +  R{t/n)v")k3  af n)vr  =  osc  J  N, 


\ j  aji 


i,j£l 


k= 0 


i,j£l 


where 


ske  s 


k\ 


ds  (/ 


W  +  R)k. 


To  conclude,  we  have  shown  that  applying  Corollary  C.6  to  the  weight  vector  v  = 
( t/n)V  and  taking  the  limit  as  n  — »  oo  and  r  — >  oo,  respectively,  yields  the  estimate 


Ip/  -  p/I  <  Y  OSCif  Y  ai 

i,j£l 


Y  OSCif  (P  ®  p)Vj- 


It  remains  to  note  that  tke  t/k\  is  the  density  of  a  Gamma  distribution  (with  shape 
k  +  1  and  scale  1),  so  f*  ske~s /k\  ds  <  1  and  thus  N  <  N  elementwise.  □ 

We  can  now  complete  the  proof  of  Theorem  6.4. 


Proof  of  Theorem  6-4-  Once  again,  we  will  assume  without  loss  of  generality  that  / 
is  a  local  function  (so  that  only  finitely  many  osc  if  are  nonzero).  The  extension  to 
quasilocal  /  follows  readily  by  localization  as  in  the  proof  of  Lemma  C.3. 
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We  begin  by  showing  that  the  second  term  in  Proposition  C.8  vanishes  as  t  — »  oo. 
Indeed,  for  any  n  >  0,  we  can  evidently  estimate  the  second  term  as 

OO  fc  _-f. 

Y  ~jr  OSCif  (I~w  +  R)ij  (p  ®  pH 

k= 0  zj'E/ 

n  —t 

<  sup  Y  osc if  (I  —  W  +  R%  (p  ®  pH  Y  ~u~ 

^  i,j£l  k= 0 

+  sup  V  0SCj/  (/  —  W  +  i?)  •  ■  (p  <g>  pH- 

t>n 

By  assumption  (6.1)  and  as  /  is  local,  the  two  terms  on  the  right  vanish  as  t  — »  oo 
and  n  — >  oo,  respectively.  Thus  second  term  in  Proposition  C.8  vanishes  as  t  — >  oo. 
We  have  now  proved  the  estimate 

|p/  -  p/|  <  Y  OSCif  ai' 

i,j£l 

To  complete  the  proof  of  Theorem  6.4,  it  remains  to  establish  the  identity  N  =  DW -1. 
This  is  an  exercise  in  matrix  algebra.  By  the  definition  of  the  matrix  product,  we 
have 

p 

(/  -IP  +  R)p  =  Y  (/  —  W)nkR  •••(/  —  W)niR(I  —  W)n°. 

k= 0  no,...,rifc>0 

noS - h  nk=p-k 

We  can  therefore  write 

OO 

Yv-  W+RY 

p= 0 

oo  oo 

=  E  E  E U-+ -+”*-»-‘U<pU - « r)nhR---(i-wrR(i-w)m 

k= 0  no,...,nfc>0  p= 0 
oo 

=  Y  Y  (J  -  w)nkR  ■■■(/-  W)niR{I  -  IP)n° 

k= 0  no,...,n/;.>0 
oo 

=  £(ip-^)fcip-\ 

k=0 

where  we  have  used  that  W_1  =  y/)E0(/  —  IP)n  as  IP  is  diagonal  with  0  <  Wu 
1. 

C.4  Proof  of  Corollary  6.8 

Note  that  sup?:  Wu  <  oo  in  all  parts  of  Corollary  6.8  (either  by  assumption  or  as 
card  I  <  oo).  Moreover,  it  is  easily  seen  that  all  parts  of  Corollary  6.8  as  well  as  the 
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VI  □ 


conclusion  of  Theorem  6.4  are  unchanged  if  all  the  weights  are  multiplied  by  the  same 
constant.  We  may  therefore  assume  without  loss  of  generality  that  sup,-  Wa  <  1. 
Next,  we  note  that  as  p  and  p  are  tempered,  we  have 


sup  (p®  p)r]i  <  sup  prji(  ■  ,x*)+  sup  prji(x*,  ■ )  <  oo 

iei  iei  iei 


by  the  triangle  inequality.  To  verify  (6.1),  it  therefore  suffices  to  show  that 

lim  V(J  -  W  +  R)k-  =  0  for  all  i  e  I.  (C.l) 

We  now  proceed  to  verify  this  condition  in  the  different  cases  of  Corollary  6.8. 

Proof  of  Corollary  6.8(1 ).  It  was  shown  at  the  end  of  the  proof  of  Theorem  6.4  that 

OO  OO 

Yi1  ~W  +  R)k  =  J ^{W-lR)kW~l  =  DW~l . 

k= 0  k= 0 

As  W~l  has  finite  entries,  D  <  oo  certainly  implies  that  (/  —  W  +  R)k  — )■  0  as  k  — »  oo 
elementwise.  But  this  trivially  yields  (C.l)  when  card  I  <  oo.  □ 

Proof  of  Corollary  6.8(2).  Note  that  we  can  write 


D  =  ^{W^Rf 

k= 0 


n—1 


oo 


Y(w~lR)v^(w-lR)nk- 

p=0  fc=0 


Therefore,  if  R  <  oo  and  \\{W  1l?)ri||  <  1,  we  can  estimate 


Dll  < 


n—1 

p= o 


^  ||  (W-xR)n\\k  <  oo. 

k= 0 


Thus  D  <  oo  and  we  conclude  by  the  previous  part.  □ 

Proof  of  Corollary  6.8(3).  We  give  a  simple  probabilistic  proof  (a  more  complicated 
matrix-analytic  proof  could  be  given  along  the  lines  of  [14,  Theorem  3.21]).  Let 
p  =  W-'R.  As  llPHoo  <  1,  the  infinite  matrix  P  is  substochastic.  Thus  P  is  the 
transition  probability  matrix  of  a  killed  Markov  chain  (AA)n>o  such  that  P  (Xn  = 
j\Xn_i  =  i)  =  Pij  and  P(Xn  is  dead|AA-i  —  i)  —  1  —  Pij  (once  the  chain  dies,  it 
stays  dead).  Denote  by  (  =  inf{n  :  Xn  is  dead}  the  killing  time  of  the  chain.  Then 
we  obtain 


P(C  >  n\X0  =  i)  =  P(Xn  is  not  dead|X0  =  i )  =  Y  P%  <  II^IU  <  ||P||». 

joi 

Therefore,  as  ||P||oo  <  1,  we  find  by  letting  n  — >  oo  that  P(£  =  oo | XQ  =  i)  =  0.  That 
is,  the  chain  dies  eventually  with  unit  probability  for  any  initial  condition. 
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Now  define  P  =  I  —  W  +  R  =  I  —  W  +  WP.  As  sup?:  W„  <  1,  the  matrix  P  is  also 
substochastic  and  corresponds  to  the  following  transition  mechanism.  If  Xn_{  =  i, 
then  at  time  n  we  flip  a  biased  coin  that  comes  up  heads  with  probability  Wv .  In  case 
of  heads  we  make  a  transition  according  to  the  matrix  P,  but  in  case  of  tails  we  leave 
the  current  state  unchanged.  From  this  description,  it  is  evident  that  we  can  construct 
a  Markov  chain  (An)n>0  with  transition  matrix  P  by  modifying  the  chain  (. Xn)n>o  as 
follows.  Conditionally  on  (Xn)n>o,  draw  independent  random  variables  (£n)n>o  such 
that  is  geometrically  distributed  with  parameter  Wxnxn-  Now  define  the  process 
(Xn)n>0  such  that  it  stays  in  state  X0  for  the  first  £0  time  steps,  then  is  in  state  Xi 
for  the  next  £1  time  steps,  etc.  By  construction,  the  resulting  process  is  Markov  with 
transition  matrix  P.  Moreover,  as  (  <  oo  a.s.,  we  have  (  :=  infjn  :  Xn  is  dead}  <  oo 
a.s.  also.  Thus 


lim  V (/  -  W  +  B)l  =  lim  P(C  >  n\X0 

n. — Vnn  •  ^  ^  n. — Von 


=  i)  =  0 


je/ 


for  every  %  6  I.  We  have  therefore  established  (C.l). 
Proof  of  Corollary  6.8(f).  We  begin  by  writing  as  above 


□ 


y  (/  ~w+ R)k  =  j2(w~lR)kw^  =  y 

k= 0  k= 0  k= 0 

where  the  last  identity  is  straightforward.  Arguing  as  in  Corollary  6.8(2),  we  obtain 

OO 

^(W1) 


k= 0 


r)%  =  v  y,(r  w-%  < 

k= o  jei  jei  k= o 

71—1  OO 

<Eiiw_1ii»Eikw_1 


—  1  \  n  1 1  k 


<  OO. 


p=0 


k= 0 


It  follows  immediately  that  (C.l)  holds. 

Proof  of  Corollary  6.8(5).  Note  that 

jei  jei  jei 

Thus  V-  \\7lj\\  <  00  and  ||PIF_1||i  <  1  yield 


□ 


OO  OO 

fc=o  jei  k= 0  jei 

which  evidently  implies 

lim  V (/  —  W  +  R)kAp  <S>  p)rjj  =  0  for  all  i  G  I. 

jOl 

We  have  therefore  established  (6.1).  □ 
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Proof  of  Corollary  6.8(6).  Let  r  =  sup{m(i,  j)  :  Rij  >  0}  (which  is  finite  by  assump¬ 
tion),  and  choose  (3  >  0  such  that  || 1| i  <  e~l3r.  Then  we  can  estimate 

II -RW7-1  IK, /3m  :=  sup^e^')(iW-%-  <  e^|| RW~%  <  1. 

i£l 

As  m  is  a  pseudometric,  it  satisfies  the  triangle  inequality  and  it  is  therefore  easily 
seen  that  ||  ■  ||li(gm  is  a  matrix  norm.  In  particular,  we  can  estimate 

ef^\RW-%  <  ||(W-1r||lj3m  <  ||  RW-X.fr, 

for  every  i,j  G  /.  But  then 

ll/w-'nu  =  Sup^(W-‘)5  <  <  oo 

i€l  ,  i&I  , 

j&I 

for  all  n.  We  therefore  have  ||-RVL_1||oo  <  oo,  and  we  can  choose  n  sufficiently  large 
that  || (-Rfy~1)n||00  <  1.  The  conclusion  now  follows  from  Corollary  6.8(4).  □ 

C.5  Proof  of  Theorem  6.12 

In  the  case  of  one-sided  local  updates,  the  measure  p<k  is  7J-invariant  for  r(J)  =  k 
(but  not  for  r(  J)  <  k).  The  proof  of  Theorem  6.12  therefore  proceeds  by  induction  on 
k.  In  each  stage  of  the  induction,  we  apply  the  logic  of  Theorem  6.4  to  the  partial  local 
updates  (yJ)  j£3:T(j)=k,  and  use  the  induction  hypothesis  to  estimate  the  remainder 
term. 

Throughout  this  section,  we  work  in  the  setting  of  Theorem  6.12.  Define 

I<k  :={?€/:  r(i)  <  k},  Ik  :=  {i  G  /  :  r(i)  =  k}. 

Note  that  we  can  assume  without  loss  of  generality  that  Rl3  =  0  whenever  r  (j )  >  t{i). 
Indeed,  the  local  update  rule  7^  does  not  depend  on  x3  for  r(j)  >  r(J),  so  we  can 
trivially  choose  the  coupling  QJX  Z  for  a/ALI  =  ^AO'}  such  that  Q'f  zql  =  0  for  all  i  G  J. 
On  the  other  hand,  the  choice  R.tJ  =  0  evidently  yields  the  smallest  bound  in  Theorem 
6.12.  I11  the  sequel,  we  will  always  assume  that  RtJ  =  0  whenever  r(j)  >  r(i). 

The  key  induction  step  is  formalized  by  the  following  result. 

Proposition  C.10.  Assume  (6.1).  Let  (A)ie/<fe_i  be  nonnegative  weights  such  that 

\p<k-i9  ~  P<k-ig\  <  E  osc i.g  fa 

iOl<k-l 

for  every  bounded  measurable  quasilocal  function  g  on  §<&_!  so  that  osc ig  <00  Vi. 
Then 

I  P<kf  ~  P<kf\  <  ^  <  osc  jf  +  ^  osc  if  Du  ( W~lR)ij  \  Pj  +  ^  osc  if  Dl}  Wf^aj 

joi<k- 1  l  i,ioik  J  hjeik 

for  every  bounded  measurable  quasilocal  function  f  on  §<fc  so  that  osc  if  <00  Vi. 
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Proof.  We  fix  throughout  the  proof  a  bounded  and  measurable  local  function  /  : 
§<fc  — >  M  such  that  osc if  <  oo  for  all  i  G  I<k.  The  extension  of  the  conclusion  to 
quasilocal  functions  /  follows  readily  by  localization  as  in  the  proof  of  Lemma  C.3. 

We  denote  by  Gv  and  Gv  the  Gibbs  samplers  as  defined  in  Section  C.2.  Let 
us  enumerate  the  partial  cover  {J  G  d  ■  t(J)  =  k}  as  {Ji,  J2,  ■  ■  ■},  and  define  the 
weights  vr  as  in  the  proof  of  Proposition  C.8.  By  the  definition  of  the  one-sided  local 
update  rule,  p<k  is  Guv'  -invariant  and  p<k  is  Guv -invariant  for  every  r,  u  such  that 
Y1juvj  <  1.  Thus 

|P<*/-P<*/I<  E  oacj  Nfin>  af  +  \p<l(G’"'yf  - 

as  in  the  proof  of  Corollary  C.6,  with  the  only  distinction  that  we  refrain  from  using 
the  Wasserstein  matrix  to  expand  the  second  term  in  the  proof  of  Proposition  C.2. 
We  now  use  the  induction  hypothesis  to  obtain  an  improved  estimate  for  the  second 
term. 

Lemma  C.ll.  We  can  estimate 

\p<kg  -  p<kg |  <  Y  osci9  Pi  +  3  ^  osc tg  (p  <g)  p)pi 

i£l<k— 1 

for  any  bounded  and  measurable  quasilocal  function  g  :  §<fc  — >■  M.  such  that  osc <  00 
VL 

Proof.  For  any  x  G  §<&  we  can  estimate 

I P<k9  ~  P<k9\  <  \p<k-i9x  ~  P<k-i9x\  +  | P<k{g  ~  9x)\  +  | P<k{g  ~  9x)\, 

where  we  defined  gx{z)  g(zI^k~1xIk).  By  Lemma  C.3  we  have 

I g(z)  -  gx(z) |  <  osc igr}i(zi,Xi). 

We  can  therefore  estimate  using  the  induction  hypothesis  and  the  triangle  inequality 
\p<k9  -  P<k9\  <  Y  OSCi9  &  +  ^  oscW  iPVii  '  ,  xf)  +  Pifxj,  Xj)  +  prji(  •  ,  Xj)} 

— 1  i£.Ik 

for  all  x,  x  G  $<k.  Now  integrate  this  expression  with  respect  to  p(dx)  p(dx).  □ 

To  lighten  the  notation  somewhat  we  will  write  v  =  «vr  until  further  notice.  Note 
that  by  construction  aj  =  0  whenever  r(j)  <  k,  while  Rf-  =  0  whenever  r{j)  >  r{i) 
by  assumption.  Thus  we  obtain  using  Lemma  C.ll  and  Lemma  C.5 

I P<kf  ~  P<kf\  <  Y  OSCif  Aqfn)  aJ  +  3  Y  osc^  i1  ~WV  +  ^V)p-  (P  ®  P)'9j 

ij&h  i,j£h 

+  E  E  osc,/! I-W'  +  R^fy, 

i£l<k  j£l<k- 1 
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provided  that  JT  osc*/  (J  ~  bhv  +  Rv)?j  <  00  f°r  all  j. 

Next,  note  that  as  vj  =  0  for  r(J)  <  k,  we  have  RJj  =  Wfj  =  0  for  i  G  I<k-i- 

Thus 

Vv  =  I  -  Wv  +  Rv  = 
where  Vv  :=  {Vff)itjeIk  and  Rv  :=  In  particular, 

(/-r  +  aT=((7 

Moreover,  as  /?,))  =  0  whenever  r(j)  >  r(z),  we  evidently  have  (Vv)k  =  (Vv)h  for 
i,j  G  h-  Substituting  into  the  above  expression,  we  obtain 

I p<tj  -  ~p<kf\  <  Y,  osc‘f  Kn)  “J  + 3  E  osc</  C  -  irv  +  flv)S  (p  ®  PH 

i,j&h  i,j&h 

+  { oscn/'  +  J]  osc^  iV^(n)  }  fo 

j€-I<k  —  l  v  J 

provided  that  JT  oscj/  (/  —  1TV  +  Rv)™j  <  00  for  all  j.  But  the  latter  is  easily  verified 
using  (6.1)  and  Lemma  C.9,  as  /  is  local  and  osc if  <  00  for  all  i  by  assumption. 

The  remainder  of  the  proof  now  proceeds  precisely  as  in  the  proof  of  Proposition 
C.8  and  Theorem  6.4.  We  set  v  =  (t/n)vr,  let  n  — >  00  and  then  r  — y  00.  The 
arguments  for  the  first  two  terms  are  identical  to  the  proof  of  Proposition  C.8,  while 
the  argument  for  the  third  term  is  essentially  identical  to  the  argument  for  the  first 
term.  The  proof  is  then  completed  as  in  the  proof  of  Theorem  6.4.  We  leave  the 
details  for  the  reader.  □ 

We  now  proceed  to  complete  the  proof  of  Theorem  6.12. 

Proof  of  Theorem  6.12.  Consider  first  the  case  that  :=  inf ,;e/r(z)  >  —00.  In  this 
setting,  we  say  that  the  comparison  theorem  holds  for  a  given  k  >  k_  if  we  have 

\P<kf  -  P<kf  |  <  X  °SC^  Di3  "  .p/1"' 

for  every  bounded  measurable  quasilocal  function  /  on  §<fc  such  that  osc  if  <  00  Vi. 
We  can  evidently  apply  Theorem  6.4  to  show  that  the  comparison  theorem  holds  for 
k-.  We  will  now  use  Proposition  C.10  to  show  that  if  the  comparison  theorem  holds 
for  k  —  1,  then  it  holds  for  k  also.  Then  the  comparison  theorem  holds  for  every 
k  >  k_  by  induction,  so  the  conclusion  of  Theorem  6.12  holds  whenever  /  is  a  local 
function.  The  extension  to  quasilocal  /  follows  readily  by  localization  as  in  the  proof 
of  Lemma  C.3. 

We  now  complete  the  induction  step.  When  the  comparison  theorem  holds  for 
k  —  1  (the  induction  hypothesis),  we  can  apply  Proposition  C.10  with 

a=  E  DawnW 

j&I<k- 1 
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This  gives 


i p<kf  -  p<kf\  <  E  E osc^  Da  c w~lR)i<!  D<u 

j,q£l<k-i 

+  E  osc*/  Dij  Wjj1(1i  +  E  osc^ 

i,j£l<k- 1 

for  every  bounded  measurable  quasilocal  function  /  on  §<fc  so  that  osCj/  <  oo  Vi.  To 
complete  the  proof,  it  therefore  suffices  to  show  that  we  have 

E  Du  (W  1R)iqDqj  for  i  G  ifc,  j  G  I<k~i- 
q&i<k- 1  £g4 

To  see  this,  note  that  as  Rij  =  0  for  r(i)  <  r(j),  we  can  write 

OO 

=  E  E  • ' '  (W-1R)j,jt<.W-1R)jlj 

P=1  ilv-Jp-lS/: 

Tti)<Ttil)<--<Ttip-l)<k 

OO  p 

=  E  E  E  E 

p=i  n=1  /e4  q£i<k- 1 


for  i  G  /fc  and  j  G  I<k-i,  where  we  have  used  that  whenever  r(ji)  <  ■  ■  ■  <  r(jp_i)  <  k 
there  exists  1  <  n  <  p  such  that  ju...,jp_n  G  /<fc_i  and  jp-n+1, . . . ,  jp-i  G  Ik. 
Rearranging  the  last  expression  yields  the  desired  identity  for  Dt] ,  completing  the 
proof  for  the  case  k_  >  — oo  (note  that  in  this  case  the  additional  assumption  (6.2) 
was  not  needed). 

We  now  turn  to  the  case  that  k-  =  — oo.  Let  us  say  that  (/ 3pi£i<k  is  a  fc-estimate 

if 

I  P<k9  ~  P<k9 1  <  E  osc  j<7  fa 


i&I<k 


for  every  bounded  measurable  quasilocal  function  g  on  S<k  such  that  osc ig  <  oo  Vi. 
Then  the  conclusion  of  Proposition  C.10  can  be  reformulated  as  follows:  if  (A)ie/<fe_i 
is  a  (k  —  l)-estimate,  then  {P[)i^i<k  is  a  ^-estimate  with  p[  =  Pi  for  i  G  I<k~i  and 


K=  E  + 

1  lOlk  j&Ik 

for  i  G  Ik.  We  can  therefore  repeatedly  apply  Proposition  C.10  to  extend  an  initial 
estimate.  In  particular,  if  we  fix  k  G  Z  and  n  >  1,  and  if  ( Pi)i£i<k_n  is  a  (k  —  n)- 
estimate,  then  we  can  obtain  a  fc-estimate  (/3()?;e/<fc  by  iterating  Proposition  C.10  n 
times.  We  claim  that 


k—r 


#  =  E  \  E  Y,D“(w~lR)‘i ft+E^-^P 

s=k-n+ 1  f  je/<fc_n  ze/3  je/s 
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for  0  <  r  <  n  —  1  and  i  G  h-r-  To  see  this,  we  proceed  again  by  induction.  As 
(A)ie/<fc_n  is  a  (P  —  n)-estimate,  the  expression  is  valid  for  r  =  n  —  1  by  Proposition 
C.10.  Now  suppose  the  expression  is  valid  for  all  u  <  r  <  n  —  1.  Then  we  obtain 

A'=  E  E  Du(W-'R)ljPi+  Y.  DijWja, 

j £l<k  —  n  —  u  j^^k  —  u 

k—u—1  S 

+  E  EE  E  E 

s=k-n+ 1  jG/s  lelk-u  t=k-n+ 1  qel<k-n  P^h 
k—u—1  s 

+  E  EE  E  E 

s=k— n+1  jG/s  l£lk—u  t=k-n-\- 1  gG/t 


for  i  G  Ik-u  by  Proposition  C.10.  Rearranging  the  sums  yields 

ft'=  E  E  Ou(W-1R)lj(!j+  Y  D‘iWiilai 

j£l<k—n  l^Ik  —  u  j^^k  —  u 

k—u—1  ( 

+  E  (  E  Y,^(w^R)„pq+YB^wppla- 

t=k-n+ 1  l  q £l<k-n  P&h  P&h 


for  i  G  Ik-u-,  where  we  have  defined 


*-i 

Drr-=YY.Y.D^W~lR^DU 

e=s  q&It  iGlt 


whenever  i  G  It  and  j  G  Is  for  s  <  t.  But  as  Dqj  =  0  when  r(g)  <  r(j),  we  have 
Dij  =  ^2  ^  Du  (VP_1R)/?  Dqj  =  Di:j  for  i  G  It,  j  G  I<t- 1 

q£l<t— i 

using  the  identity  used  in  the  proof  for  the  case  P_  >  —  oo,  and  the  claim  follows. 

We  can  now  complete  the  proof  for  the  case  P_  =  — oo.  It  suffices  to  prove  the 
theorem  for  a  given  local  function  /  (the  extension  to  quasilocal  /  follows  readily 
as  in  the  proof  of  Lemma  C.3).  Let  us  therefore  fix  a  K -local  function  /  for  some 
K  G  J,  and  let  k  =  rnaxje j<  t(i)  and  n  >  1.  By  Lemma  C.3,  we  fold  that  {fa)iei<k_n  is 
trivially  a  (k  —  n)-estimate  if  we  set  fa  =  (p<8>  p)rji  for  i  G  I<k-n ■  We  therefore  obtain 

Id/  -  P/I  <  y  osc  if  Wjj1  a  j  +  E  E  OSC  if  Dij  (p  ®  p)r)j 

i ij *G/  j£l<k  —  n 

from  the  P-estimate  {fai)i^i<k  derived  above,  where  we  have  used  that  DW_1R  <  D. 
But  as  /  is  local  and  osc  if  <  oo  for  all  i  by  assumption,  the  second  term  vanishes  as 
n  — *  oo  by  assumption  (6.2).  This  completes  the  proof  for  the  case  P_  =  — oo.  □ 


178 


C.6  Block  particle  filter,  improved  analysis 


In  the  remaining  of  this  appendix  we  provide  the  proof  of  Theorem  6.13.  We  assume 
to  work  in  the  same  setting  introduced  in  Chapter  4.  Recall  the  following  three 
recursions: 


7TM  ■=  F 

n  •  1  r 


Fi  H, 


7 ■  Fn  •  •  •  F i/i, 


7R  •=  F 

J[n  •  rn 


Fi/i, 


where  Fn  :=  CnP,  Fn  :=  CnBP,  and  Fn  :=  CnBSA  P.  This  allows  to  decompose  the 
approximation  error  into  two  terms,  one  due  to  localization  and  one  due  to  sampling 


7IT 


7T7 


lj< 


bias  variance 


by  the  triangle  inequality  (see  Section  4.5.1).  In  the  proof  of  Theorem  6.13,  each  of 
the  terms  on  the  right  will  be  considered  separately.  The  first  term,  which  quantifies 
the  bias  due  to  the  localization,  will  be  bounded  in  Section  C.6.1.  The  second  term, 
which  quantifies  the  sampling  variance,  will  be  bounded  in  Section  C.6. 2.  Combining 
these  two  bounds  completes  the  proof. 


C.6.1  Bounding  the  bias 

The  goal  of  this  section  is  to  bound  the  bias  term  ||7t£  —  7t£||  j,  where  we  recall  the 
definition 

\\^-u\\j:=  sup  \nf-vf\ 

/6XJ:|/|<1 

the  local  total  variation  distance  on  the  set  of  sites  J.  [Note  that  ||/i  —  is\\ j  <  K  for 
some  K  e  M  evidently  implies  |||/i  —  u\\\j  <  K\  the  random  measure  norm  IIHHj  will 
be  essential  to  bound  the  sampling  error,  but  is  irrelevant  for  the  bias  term.] 

Let  us  first  give  an  informal  outline  of  the  ideas  behind  the  proof  of  the  bias 
bound.  While  the  filter  7t£  is  itself  a  high- dimensional  distribution  (defined  on  the 
set  of  sites  V),  we  do  not  know  how  to  obtain  a  tractable  local  update  rule  for  it.  We 
therefore  cannot  apply  Theorem  6.4  directly.  Instead,  we  will  consider  the  smoothing 
distribution 

P  =  Pa(X1,...,Xne  -|Yi,...,Yn), 

defined  on  the  extended  set  of  sites  /  =  {1, . . . ,  n}  x  V  and  configuration  space 
S  =  Xn.  As  (X%,Y£)(k,v)ei  is  a  Markov  random  held  (cf.  Figure  4.1),  we  can  read 
off  a  local  update  rule  for  p  from  the  model  definition.  At  the  same  time,  as  7 = 
Yu(Xn  G  ■  |  Yi, . . . ,  Yn)  is  a  marginal  of  p,  we  immediately  obtain  estimates  for  7t£ 
from  estimates  for  p. 

This  basic  idea  relies  on  the  probabilistic  definition  of  the  filter  as  a  conditional 
distribution  of  a  Markov  random  held:  the  filtering  recursion  (which  was  only  in¬ 
troduced  for  computational  purposes)  plays  no  role  in  the  analysis.  The  block  hlter 
7 i£,  on  the  other  hand,  is  defined  in  terms  of  a  recursion  and  does  not  have  an 
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intrinsic  probabilistic  interpretation.  In  order  to  handle  the  block  filter,  we  will  ar¬ 
tificially  cook  up  a  probability  measure  P  on  §  such  that  the  block  filter  satisfies 
K.  =  P(Xn  G  •  |Yi ,...,Yn),  and  set 

P  =  p(xu...,xne  -  \Yu...,Yn). 

This  implies  in  particular  that 

IK  -<IU  =  \\P~  P\\{n}xj, 

and  we  can  now  bound  the  bias  term  by  applying  Theorem  6.4. 

To  apply  the  comparison  theorem  we  must  choose  a  good  cover  3-  It  is  here  that 
the  full  flexibility  of  Theorem  6.4,  as  opposed  to  the  classical  comparison  theorem, 
comes  into  play.  If  we  were  to  apply  Theorem  6.4  with  the  singleton  cover  3S  —  {{0  : 
i  G  /},  we  would  recover  the  result  of  Theorem  4.2:  in  this  case  both  the  spatial  and 
temporal  interactions  must  be  weak  in  order  to  ensure  that  D  =  ^n(W_1.R)n  <  oo. 
To  avoid  this  problem,  we  work  instead  with  larger  blocks  in  the  temporal  direction. 
That  is,  our  blocks  J  G  3  will  have  the  form  J  =  {k  +  1 , ,k  +  q}  x  {u}  for  an 
appropriate  choice  of  the  block  length  q.  The  local  update  7^  now  behaves  as  q  time 
steps  of  an  ergodic  Markov  chain  in  XO  the  temporal  interactions  decay  geometrically 
with  q,  and  can  therefore  be  made  arbitrarily  small  even  if  the  interaction  in  one  time 
step  is  arbitrarily  strong.  On  the  other  hand,  when  we  increase  q  there  will  be  more 
nonzero  terms  in  the  matrix  W^1R.  We  must  therefore  ultimately  tune  the  block 
length  q  appropriately  to  obtain  the  result  of  Theorem  6.13. 

Remark  C.12.  The  approach  used  here  to  bound  the  bias  directly  using  the  compar¬ 
ison  theorem  is  different  than  the  one  used  in  Chapter  4,  which  exploits  the  recursive 
property  of  the  filter.  The  latter  approach  has  a  broader  scope,  as  it  does  not  rely  on 
the  ability  to  express  the  approximate  filter  as  the  marginal  of  a  random  field  as  we  do 
above:  this  could  be  essential  for  the  analysis  of  more  sophisticated  algorithms  that  do 
not  admit  such  a  representation.  For  the  purposes  of  the  current  analysis,  however, 
the  present  approach  provides  an  alternative  and  somewhat  shorter  proof  that  is  well 
adapted  to  the  analysis  of  block  particle  filters. 

Remark  C.13.  The  problem  under  investigation  is  based  on  an  interacting  Markov 
chain  model,  and  is  therefore  certainly  dynamical  in  nature.  Nonetheless,  our  proofs 
use  Theorem  6.4  and  not  the  one-sided  Theorem  6.12.  If  we  were  to  approximate 
the  dynamics  of  the  Markov  chain  Xn  itself,  it  would  be  much  more  convenient  to 
apply  Theorem  6.12  as  the  model  is  already  defined  in  terms  of  one-sided  conditional 
distributions  p(x,  z)ip(dz).  Unfortunately,  when  we  condition  on  the  observations 
Yn,  the  one-sided  conditional  distributions  take  a  complicated  form  that  incorporates 
all  the  information  in  the  future  observations,  whereas  conditioning  on  all  variables 
outside  a  block  J  G  3  gives  rise  to  relatively  tractable  expressions.  For  this  reason,  the 
static  “space-time”  picture  remains  the  most  convenient  approach  for  the  investigation 
of  high- dimensional  filtering  problems. 

We  now  turn  to  the  details  of  the  proof.  We  first  state  the  main  result  of  this 
section. 
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Theorem  C.14  (Bias  term).  Suppose  there  exist  0  <  e,  6  <  1  such  that 

sqv(xv,zv)  <pv(x,zv)  <  e~lqv{xv ,  zv), 
s  <  qv(x\zv)  <  r1 

for  every  v  G  V  and  x,  z  G  X,  where  qv  :  Hv  x  Xv  — >  M+  is  a  transition  density  with 
respect  to  ipv.  Suppose  also  that  we  can  choose  q  G  N  and  f3  >  0  such  that 

c  :  =  3gAV(9+2r)(l  -  <{A+1))  +  <(1  -  e262)  +  e^(l  -  e252)q  <  1. 

Then  we  have 

IK  -  <||  j<  ^—c  (1  -  £2(<?+1)A)  card  je-ddUdK) 

for  every  n  >  0,  a  G  X,  K  G  %  and  J  C  K . 

In  order  to  use  the  comparison  theorem,  we  must  have  a  method  to  construct 
couplings.  Before  we  proceed  to  the  proof  of  Theorem  C.14,  we  begin  by  formulating 
two  elementary  results  that  will  provide  us  with  the  necessary  tools  for  this  purpose. 

Lemma  C.15.  If  probability  measures  /i,  u,  7  satisfy  /r(A)  >  07(A)  and  z/(A)  > 
07(A)  for  every  measurable  set  A,  then  there  is  a  coupling  Q  of  p,v  such  that 
J  1  x^z  Q(dx,  dz)  <l  —  o. 

Proof.  Define  jl  =  (fi  —  07)/(l  —  o),  u  =  (v  —  07)/(l  —  o),  and  let 

Qf  =  a  J f(x,  x)  ^f(dx)  +  (1  —  o)  J  ffx,z)fi(dx)v(dz). 

The  claim  follows  readily.  □ 

Lemma  C.16.  Let  Pi, . . . ,  Pq  be  transition  kernels  on  a  measurable  space  T,  and 
define 

Pxiduj!, . . . ,  dujq )  =  Pi(x,  du1)P2(coi,  du2 )  •  ■  •  Pq{pjq-i,  dojq). 

Suppose  that  there  exist  probability  measures  v\,...,vq  on  T  such  that  Pi(x,  A)  > 
o i'i(A)  for  every  measurable  set  A,  x  G  T,  and  i  <  q.  Then  there  exists  for  every 
x,  z  G  T  a  coupling  QXtZ  of  px  and  pt,z  such  that  f  1^.^/  QXtZ(dco,  du')  <  (1  —  a)1  for 
every  i  <  q. 

Proof.  Define  the  transition  kernels  Pj  =  (Pj  —  avf)f{\  —  o)  and 

Qif(x,  z)  —  a  J  f{x',x')vi(dx')  +  {l-Oi)lx^z  J  f(x',z')Pi{x,dx')Pi(z,dz') 

+  (1  -  o)  lx=z  J  /«  x')  Pi{x,  dx'). 

Then  Qi(x,  z,  •  )  is  a  coupling  of  P,:(x,  • )  and  Ppz,  ■  ).  Now  define 

QxAdioudu'i, . . . ,  dojq,  duq)  =  Qi(x,  z,  dujh  duj[)  ■  ■  ■  Qq(u)q-i,iv'q_i,duq,du'q). 

The  result  follows  readily  once  we  note  that  J  lx'^z'  Qi(%,  z,  dx',  dz')  <  (1  —  o)  lx^z- 

a 
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We  can  now  proceed  to  the  proof  of  Theorem  C.14. 

Proof  of  Theorem  C.14-  We  begin  by  constructing  a  measure  P  that  allows  to  de¬ 
scribe  the  block  filter  irf  as  a  conditional  distribution,  as  explained  above.  We  fix  the 
initial  condition  a  £  X  throughout  the  proof  (the  dependence  of  various  quantities 
on  cr  is  implicit). 

To  construct  P.  define  for  K  £  %  and  n  >  1  the  function 

h^{x,zdK)  :=  I  K-i(duj)  JJ  pv(xKunK,zv). 

d  vedK 

Evidently  h%  is  a  transition  density  with  respect  to  (S^eax  V’L  Let 
Pn(x,  z)  :=  JJ  h% (x,  zdK)  JJ  pv(x,zv), 

Kex  veK\dK 

and  define  P np,(dx')  :=  if(dx')  f  pn(x,x')  fi(dx).  Then  Pri7f^_1  =  BP7t((_1  by  construc¬ 
tion  for  every  n  >  1,  as  tt^-i  a  product  measure  across  blocks.  Thus  we  have 

<  =  C„P  •  •  •  CiP^,  fcf  =  CnPn  ■  ■  ■  CrPi^- 

In  particular,  the  filter  and  the  block  filter  satisfy  the  same  recursion  with  different 
transition  densities  p  and  pn.  We  can  therefore  interpret  the  block  filter  as  the  filter 
corresponding  to  a  time-inhomogeneous  Markov  chain  with  transition  densities  pn: 
that  is,  if  we  set 

P[(X1,...,Xn,Y1,...,Yn)eA]  := 

/n 

lA(x1,  yn)  Pi  (cr,  Xi)  JJpfc(xfc_  i,xk)  g(xk,  yk)  tp(dxk)  <p(dyk) 

k= 2 

(note  that  Pa  satishes  the  same  formula  where  pk  is  replaced  by  p),  we  can  write 

<  =  P(Xne  •  \Y1:...,Yn). 

Let  us  emphasize  that  the  transition  densities  pk  and  operators  Pk  themselves  de¬ 
pend  on  the  initial  condition  a,  which  is  certainly  not  the  case  for  the  regular  filter. 
However,  since  a  is  fixed  throughout  the  proof,  this  is  irrelevant  for  our  computations. 
From  now  on  we  fix  n  >  1  in  the  remainder  of  the  proof.  Let 

p  =  P"(X1,...,Xne  -I  Yh...,Yn),  p  =  P(X1,...,Xn  £  ■  IW,  •  •  • ,  Yn). 

Then  p  and  p  are  probability  measures  on  S  =  Xn,  which  is  naturally  indexed  by  the 
set  of  sites  /  =  {1, . . .  ,n}  x  V  (the  observation  sequence  on  which  we  condition  is 
arbitrary  and  can  be  considered  fixed  throughout  the  proof).  The  proof  now  proceeds 
by  applying  Theorem  6.4  to  p,p,  the  main  difficulty  being  the  construction  of  a 
coupled  update  rule. 
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Fix  q  >  1.  We  first  specify  the  cover  3  =  {J/)  :  1  <  l  <  \n/q~],v  G  V}  as  follows: 

Ji  :=  {(/  —  1  )q  +  1, . . . ,  Iq  A  n}  x  {u}  for  1  <  l  <  \n/q],  v  G  V. 

We  choose  the  natural  local  updates  jx  (dzJ)  =  p(dz  '  \xT^J)  and  j£(dzJ)  =  p(dzJ\xI\J), 
and  postpone  the  construction  of  the  coupled  updates  Qxz  and  QJX  to  be  done  below. 
Now  note  that  the  cover  3  is  in  fact  a  partition  of  /;  thus  Theorem  6.4  yields 

\\<-<}\j  =  \\P~  P\\{n}xj  <2 

i£{n}xJ  jEl 

provided  that  D  =  YlkLo^k  <  00  (cf-  Corollary  6.8),  where 

Qj  =  sup  [  lu-jtu'  QJx[l\du,du'),  bi  =  sup  f  lUi^Q^l)(du,du'), 
x,zG S:  J  xG§  J 

xi\m=zi\m 

and  where  we  write  J(i)  for  the  unique  block  J  G  3  that  contains  i  G  I.  To  put  this 
bound  to  good  use,  we  must  introduce  coupled  updates  Qx  z  and  Qx  and  estimate  CV,- 
and  bj. 

Let  us  fix  until  further  notice  a  block  J  =  G  3 •  We  will  consider  first  the  case 
that  1  <  l  <  \n/q\  \  the  cases  l  =  1,  \n/q]  will  follow  subsequently  using  the  identical 
proof.  Let  s  =  (l  —  1  )q.  Then  we  can  compute  explicitly  the  local  update  rule 

I  1  A{xJ)pv(xa,Xva+1)  Y\Smls+l9V{.XVmiY™)  Uw,N(v)  PWjxm,xZ+i)^V{dxvm) 

f  pv(xs,xvs+1)  IISh-1  n u,gn(v) Pw(xm,xZ+i)'ipv(dxvm) 

using  Bayes’  formula,  the  definition  of  P0-  (in  the  same  form  as  the  above  definition 
of  P),  and  that  pv(x,zv )  depends  only  on  xN^v\  We  now  construct  couplings  Qxz 
of  7^  and  7/  where  x,  z  differ  only  at  the  site  j  =  (k,  w)  G  /.  We  distinguish  the 
following  cases: 

1.  k  =  s,  w  G  N(v)\{v}; 

2.  k  —  s,  w  —  v, 

3.  k  G  {s  +  1, . . . ,  s  +  q},  w  G  UuGJV(„)  N(u)\{v}] 

4.  k  =  s  +  q  +  1,  w  E  N(v)\{v}; 

5.  k  =  s  +  q  +  1,  w  =  v. 

It  is  easily  verified  by  inspection  that  7^  does  not  depend  on  x %  except  in  one  of  the 
above  cases.  Thus  when  j  satisfies  none  of  the  above  conditions,  we  can  set  CJtj  =  0 
for  i  G  J. 
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Case  1.  Note  that 


it  {A)  > 


1 1  a(xj)  qv(xvs,  xvs+1)  nS+1  gv(xvm,  YZ)  n^)  PW(xm,  xZ+i)  V(dxvm) 


f  qV(xvs,  X«+1)  rim=s+l  9V(xm>  Ym )  UweN(v)  PW(Xm,  <C+l)  ^ (dx 


and  the  right  hand  side  does  not  depend  on  x “  for  w  ^  v.  Tims  whenever  x,  z  G  § 
satisfy  =  zIY^  for  j  =  (s,w)  with  w  G  N(y)\{v},  we  can  construct  a  coupling 

QJXZ  using  Lemma  C.15  such  that  CV,-  <  1  —  e2  for  every  i  e  J. 

Case  2.  Define  the  transition  kernels  on  Xv 


/  1  A{xvk)  pv(ujxlt\v},xvk)  n Zt=k  9V(Xmi  Ym )  ELevp,)  PW(xm,  xZ+l)  V  {dxVJ 


jpv{uXklY\xvk)  n  tnlh  gV(xVmi  Ym)  II«,ejV(«)  PW(Xm,  C+l) 

for  k  —  s  +  1, . . . ,  s  +  q.  By  construction,  Pk,x(xk_ 1;  dxvk)  =  ltidxvk\xvs+1, . . . ,  so 

we  are  in  the  setting  of  Lemma  C.16.  Moreover,  we  can  estimate 


i\z(w,a4)  >£2<52 


/  1a(4)  IlSfc  ‘/"PC  jg)  n»:gjv(^)  PM(gm,  <+i) 

/  n Sfc  Fm)  n«,eJV(v)  <+l)  ^V{dxvJ 


where  the  right  hand  side  does  not  depend  on  u.  Thus  whenever  x,  z  G  §  satisfy 
^AL'}  —  ^ALd  for  j  =  (. s,v ),  we  can  construct  a  coupling  using  Lemma  C.16 
such  that  Cij  <  (1  —  e262)k~s  for  i  =  (k,  v)  with  k  —  s  +  1, . . . ,  s  +  q. 

Case  3.  Fix  k  G  {s  +  1, . . . ,  s  +  q}  and  u  ^  v.  Note  that 


lt(A)  >  e2(A+1)  x 

I  1  A(xJ)  pV(xs,  xva+1)  rim=9S+l  gV(XZ,  YZ.)  UweN(v)  PmiXrn,  X 

f  pv(xs,xvs+1)  Hu,eN(v)P%(XrnixZ+l 


m+1  )^V(dxVm) 

)r(dx-j 


where  we  set  ( xm ,  x™+1)  =  qw(x ™ ,  x™+1)  if  either  m  =  k  or  m  =  k  —  1  and  w  —  u, 
and  ,3“(im,x“+i)  =  pw(xm,  x“+1)  otherwise.  The  right  hand  side  of  this  expression 
does  not  depend  on  xj£  as  the  terms  qw(x'%l,  x“+1)  for  w  ^  v  cancel  in  the  numerator 
and  denominator.  Thus  whenever  i,z  G  S  satisfy  xIYA  =  zI"'S']'s  for  j  =  (k,u),  we 
can  construct  a  coupling  using  Lemma  C.15  such  that  CtJ  <  1  —  e2(A+1)  for  every 
i  G  J. 

Case  4.  Let  u  G  N(v)\v.  Note  that 


it  (A)  > 

/  1  a(xj)pv(xvs,  <+1)  jlgX+i  YZ)  UweNjv)  Pm(xm,  x%+1)  ^v{dx°m) 

f  Pv(xvs,  xva+1)  rim=9S+l  QV(Xm,  YZ)  IlweNM  PZ(Xm,  a  " ' 


^  Qhvf/lrv  \ 


where  we  set  ^(xm,  x“+1)  =  qw(x%l ,  x™+1)  if  m  =  s  +  q  and  w  =  u,  and  we  set 
Pm(xm,  xZ+i)  —  Pw(xm,xZ+ 1)  otherwise.  The  right  hand  side  does  not  depend  on 
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x“+(J+ 1  as  the  term  g“(x“+?,  x“+g+1)  cancels  in  the  numerator  and  denominator.  Thus 
whenever  x,  z  G  §  satisfy  xTA}  =  2  A  O'}  for  j  =  (s  +  q  +  1  ,u),  we  can  construct  a 
coupling  Qxz  using  Lemma  C.15  such  that  Cij  <  1  —  e2  for  every  i  G  J. 

Case  5.  Dehne  for  k  —  s  +  l,...,s  +  q  the  transition  kernels  on  X,; 


■/4) 

/  lA(gfc)  P"(g8,  ^’+l)  rim=a+l  9V{xVm,  ym)  XZ+l)  ^(^m) 

/  Pw(a;s,  ^+1)  ni=s+l  ym)  n«,ejV(«)  ^(Xm,  X^+l)  ^(^m) 

where  we  set  =  pv{xk,u )  if  m  =  k  and  ic  =  v,  and  = 

pw(xmi  otherwise.  By  construction,  Pfc)a.(xj;+1, 

so  we  are  in  the  setting  of  Lemma  C.16.  Moreover,  we  can  estimate 


PkJuj,A)  >  e2S2  x 
f  lA(xvk)pv(xs,xvs+1 

j  pv(xs,xvs+1)  n 


Ut=s+i  9 


(Tv  y v 

v^m’  1  m. 


I\weN(v)PZ{Xm,X 


K  nv  (  tv  Yv 
m=s-\-l  y  V^m’  1  m, 


IIrueJV(i;)  Pm(Xm,  Xm+ 1 


z+i)r(dx^j 

)r(dx"m) 


j 


where  P£(xm,  x%+1)  =  1  if  m  =  k  and  w  =  v,  and  P™(xm,  x*+1)  =  pw(xm,x^+1) 
otherwise.  Note  that  the  right  hand  side  does  not  depend  on  u j.  Thus  whenever 
x,  z  G  §  satisfy  xrA}  —  ^AUI  for  j  —  (s  +  q  +  l,r;),  we  can  construct  a  coupling 
Qxz  using  Lemma  C.16  such  that  <  (1  —  e252)s+9+1_fe  for  i  =  (k,v)  with  k  = 
s  +  l,...,s  +  q. 

We  have  now  constructed  coupled  updates  Qxz  for  every  pair  x,  z  G  §  that  differ 
only  at  one  point.  Collecting  the  above  bounds  on  CtJ ,  we  can  estimate 


V'v  PP{\k-k'\+d(v’v'')}r,n  ,,,, 

/  J  &  ^ (k,v)(kf ,v') 

(, k'y)ei 

<  2e/3(9+r)(l  -  e2)A  +  e^q+2r\l  -  e2(A+1))A2g 

_|_  e^(fc-s)(i  _  £2^2^k~s  _|_  g^(s+?+l-fc)^  _  £2 fi2y+q+l-k 

<  3qA2e^q+2r\l  -  e2(A+1))  +  e/3(l  -  £2S2)  +  e^(l  -  £2S2)q  =:  c 


whenever  (k,  v)  G  J.  In  the  last  line,  we  have  used  that  ax+1  +  aq  x  is  a  convex 
function  of  x  G  [0,  g  —  1],  and  therefore  attains  its  maximum  on  the  endpoints  x  = 
0,q-l. 

Up  to  this  point  we  have  considered  an  arbitrary  block  J  =  Jf  G  d  with  1  <  l  < 
\n/q\.  It  is  however  evident  that  the  identical  proof  holds  for  the  boundary  blocks 
l  =  1,  \n/q],  except  that  for  l  —  1  we  only  need  to  consider  Cases  3-5  above  and 
for  l  =  \n/q]  we  only  need  to  consider  Cases  1-3  above.  As  all  the  estimates  are 
otherwise  identical,  the  corresponding  bounds  on  are  at  most  as  large  as  those  in 
the  case  1  <  l  <  \n/q\.  Thus 


WCWoo^m  ■=  maxVeW’^Cy  <  c, 

i€l  l — * 

jei 
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where  we  define  the  metric  m(i,j )  —  \k  —  k'\  +  d(v,v')  for  (k,v)  G  I  and  ( k',v' )  G  /. 

Onr  next  order  of  business  is  to  construct  couplings  QJX  of  7^  and  7^  and  to 
estimate  the  coefficients  bi.  To  this  end,  let  us  first  note  that  hR(x,zdK)  depends 
only  on  xa~K ,  where 

d2K:=  |J  N(w)  n  K 

w&8K 

is  the  subset  of  vertices  in  K  that  can  interact  with  vertices  outside  K  in  two  time 
steps.  It  is  easily  seen  that  7^  =  7^,  and  that  we  can  therefore  choose  bi  =  0  for 
i  G  J,  unless  J  =  JJJ  with  v  G  d2K  for  some  K  G  %.  In  the  latter  case  we  obtain  by 
Bayes’  formula 


li(A)  = 

f  3-A<y  )  n mis  9V{x°m,  jg)  hm+ liXm,  X 

r  rr+9  nv(rv  Yv)  hK  (r  raK 

J  i  Lm=s  "  \'Lmi  1  m)  nm+l  \-hmi 


OK  \  FT  nw<rr  r 

m+l)  I  LweN(v)nK\dK  P  \xm,  X 

)  Y\wGN(v)nK\dK  PW xm+l 


m+l)  dXJ ) 

)  ijj(dxJ ) 


for  1  <  l  <  \n/q\,  where  s  =  (/  —  l)g  and  dxJ )  =  0 rk  v)eJ ipv {dxvk) .  Note  that 


Yl  pw(x,zw)>eA  JJ  qw(xw,zw), 

weN(v)\(K\dK)  weN(v)\(K\dK) 


while 


h«(x,zeK)>eA  n  j  rm_  n 


pw{xKcov\K,zw). 


w£N(v)ndK 


wedK\N(v) 


We  can  therefore  estimate  ^X{A)  >  A'J+d'AfW)  and  7 X(A)  >  e2('?+1^Ar(y4)  with 


T(A)  = 

I  1  A(xJ)  nri,  9°{xVm,  ym)  P{xVm,  Xm+1 )  U  weN(v)nK\dK  PW(Xm,  X™+1)  ^(dxJ) 

I  rimiS  9VixVm,  Y&)  /3fe,  Xvm+1)  UweN(v)nK\dKPW(Xm,  xZ+l)  ^(dxJ) 

where  /3(x,z)  =  qv(x,z )  if  v  G  dK  and  /3(x,z)  =  1  if  v  G  d2K\dK.  Thus  we  can 
construct  a  coupling  QJX  using  Lemma  C.15  such  that  bi  <  1  —  £2('J+1)A  for  a]j  j  g  Jin 
the  case  1  <  l  <  \n/q].  The  same  conclusion  follows  for  l  =  1,  | ~n/q]  by  the  identical 
proof. 

We  are  now  ready  to  put  everything  together.  As  ||  •  || 00 ,/3m  is  a  matrix  norm,  we 
have 

\ 

||-D||oo,/3m  A  Halloo, Pm  —  ^  _  <  °°- 

k= 0 

Thus  D  <  00,  to  we  can  apply  the  comparison  theorem.  Moreover, 

sup  =  supe-^’J')  <  e-^J,)\\D\\oo,pm. 

jeJ'  jeJ' 
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Thus  we  obtain 


IK  -  <h  <  2(1  -  e2<<!+1)i)  ]T  ]T  By 

i£{n}xj  j£{l,...,n}xd2K 

<  —  (1  _£%+i)A)  card  Je~^d2K\ 

1  —  c 

But  clearly  d(J,d2K)  >  d(J,dK )  —  r,  and  the  proof  is  complete.  □ 

Remark  C.17.  In  the  proof  of  Theorem  C.lf  (and  similarly  for  Theorem  C.20  below), 
we  apply  the  comparison  theorem  with  a  nonoverlapping  cover  {(/  — l)g+l, . . . ,  IqAn}, 
l  <  \n/q\ .  Working  instead  with  overlapping  blocks  {s  +  1, . . . ,  s  +  q},  s  <  n  —  q  would 
give  somewhat  better  estimates  at  the  expense  of  even  more  tedious  computations. 

C.6.2  Bounding  the  variance 

We  now  turn  to  the  problem  of  bounding  the  variance  term  |||7r^  —  vf^|||j.  We  will 
follow  the  basic  approach  taken  in  Section  4.5.4  and  Section  A. 6,  where  a  detailed 
discussion  of  the  requisite  ideas  can  be  found.  In  this  section  we  develop  the  necessary 
changes  to  the  proof  in  Section  A. 6. 

At  the  heart  of  the  proof  of  the  variance  bound  lies  a  stability  result  for  the  block 
filter,  Proposition  A. 13.  This  result  must  be  modified  in  the  present  setting  to  account 
for  the  different  assumptions  on  the  spatial  and  temporal  correlations.  This  will  be 
done  next,  using  the  generalized  comparison  Theorem  6.4  much  as  in  the  proof  of 
Theorem  C.14. 

Proposition  C.18.  Suppose  there  exist  0  <  e,  5  <  1  such  that 

£qv(xv,zv)  <pv(x,zv )  <  £~1qv(xv,  zv), 
s  <  qv{xv,zv )  <  r1 

for  every  v  G  V  and  x,  z  G  X,  where  ^  :  X'  x  X”  4  M+  is  a  transition  density  with 
respect  to  .  Suppose  also  that  we  can  choose  q  G  N  and  /3  >  0  such  that 

c  :=  3gAV9(l  -  £2(A+b)  +  eP{i  _  £^2)  +  e^(l  -  e252)q  <  1. 

Then  we  have 

||  Fn  •  ■  •  Fs+1^  -  Fn  ■  •  ■  Fs+164  J  <  card 

1  —  c 

for  every  s  <  n,  a,  a  G  X,  K  6  X  and  J  C  K . 

Proof.  We  fix  throughout  the  proof  n  >  0,  K  e  X,  and  J  C  K .  We  will  also 
assume  for  notational  simplicity  that  s  =  0.  As  F^  differ  for  different  k  only  by  their 
dependence  on  different  observations  W,  and  as  the  conclusion  of  the  Proposition  is 
independent  of  the  observations,  the  conclusion  for  s  =  0  extends  trivially  to  any 
s  <  n. 
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As  in  Theorem  C.14,  the  idea  behind  the  proof  is  to  introduce  a  Markov  random 
held  p  of  which  the  block  filter  is  a  marginal,  followed  by  an  application  of  the  gener¬ 
alized  comparison  theorem.  Unfortunately,  the  construction  in  the  proof  of  Theorem 
C.14  is  not  appropriate  in  the  present  setting,  as  there  all  the  local  interactions  de¬ 
pend  on  the  initial  condition  a.  That  was  irrelevant  in  Theorem  C.14  where  the  initial 
condition  was  fixed,  but  is  fatal  in  the  present  setting  where  we  aim  to  understand 
a  perturbation  to  the  initial  condition.  Instead,  we  will  use  a  more  elaborate  con¬ 
struction  of  p  introduced  in  Section  A. 6,  called  the  computation  tree.  We  begin  by 
recalling  this  construction. 

Define  for  K'  G  OC  the  block  neighborhood  N(K' )  :=  { K "  G  %  :  d(K',K ")  <  r} 
(we  recall  that  card  A(Jl')  <  A^).  We  can  evidently  write 


BA'F,  nK  =  Cf  PK 

K"£X  K"eN(K') 


PK", 


where  we  define  for  any  probability  r]  on  XA' 

/  1  a(xk')  nveK>  9v(xv,  Ysv)  rj(dxh") 


(Cf  V)(A)  := 


IUveK'9v(xv,Ysv)ri(dxK') 

and  for  any  probability  p  on  K 

(PK'r ])(A)  :=  f  1  a(xk')  Y[pv(z,xv)jpv(dxv)p(dz). 


v&K’ 


Iterating  this  identity  yields 


B*Fn---F  = 


KnK 


C  P 

'”ra  1 


icn_ieiV(A')  L 


•  cfp*2  (g) 

O 

* 

b 

<8) 

£ 

Q_ 

U 

KiGNi  K2) 

L  KqGN  (K\)  J 

“I 

The  nested  products  can  be  naturally  viewed  as  defining  a  tree. 

To  formalize  this  idea,  define  the  tree  index  set  (we  will  write  Kn  :=  K  for 
simplicity) 


T  :=  { [Ku  ■  ■  ■  Kn_ i ]  :  0  <  u  <  n,  Ks  G  N(KS+ 1)  for  u  <  s  <  n }  U  {[0]}. 

The  root  of  the  tree  [0]  represents  the  block  K  at  time  n,  while  [Ku  ■  ■  ■  Kn_ i]  rep¬ 
resents  a  duplicate  of  the  block  Ku  at  time  u  that  affects  the  root  along  the  branch 
Ku  — *  Ku+i  Kn-i  — >  K.  The  set  of  sites  corresponding  to  the  computation 

tree  is 


/  =  {[Ku  •  ■  ■  Kn_ i]u  :  [Ku  ■  ■  ■  Kn_i\  G  T,  v  G  Ku}  U  {[0]v  :  v  G  K}, 

and  the  corresponding  conhguration  space  is  §  =  J([ig7X*  with  :=  XU  The 
following  tree  notation  will  be  used  throughout  the  proof.  Define  for  vertices  of  the 
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tree  T  the  depth  d([Ku  ■  ■  ■  Kn_ i])  :=  u  and  d([0])  :=  n.  For  every  site  [t]v  E  /,  we 
define  the  associated  vertex  v(i)  :=  v  and  depth  d(i)  :=  d([t}).  Define  also  the  sets 
I+  {i  E  I  :  d(i)  >  0}  and  T0  :=  {[£]  E  T  :  =  0}  of  non- leaf  sites  and  leaf 

vertices,  respectively.  Define 


c([Ku  •  ■  •  Kn_ i]n)  :=  {[Ku_ i  •  ■  ■  Kn-i\v'  :  Ku_ x  E  N(KU),  v'  E  N(v)}, 


and  similarly  for  c([0]v):  that  is,  c(i)  is  the  set  of  children  of  the  site  i  E  I  in  the 
computation  tree.  Finally,  we  will  frequently  identify  a  tree  vertex  [Ku  ■  ■  ■  Kn_ i]  G  T 
with  the  corresponding  set  of  sites  {[Ku  ■  ■  ■  Kn_i]v  :  v  E  Ku},  and  analogously  for 
[0], 

Having  introduced  the  tree  structure,  we  now  define  probability  measures  p,  p  on 
§by 


P(A) 

m 


f  1  a(x)  nie/+  pv^(xc^,  X*)  gv(i)(x\  Y *)  ij;v^{dxi)  n^eTo  ( dx[t] ) 

/  rLe/+  P^Kx^.x1)  gv®(xi,  Y*)  dx *)  E[[t]eTo  ( dx[t] ) 

/  1  A{x)  Uiei+  pv(:i)(xcW,  a?)  gv^(x\  Y*)  il>v®(dxii)  El^ero  Sm(dx[t]) 
I  riie/+  gv(d)(x\  Y *)  (dx*)  ri[t]eTb  Sam(dx[t]) 


where  we  write  crlKo--Kn-i]  ._  an(j  yi  ■=  Y^  for  simplicity.  Then,  by  construc¬ 
tion,  the  measure  BAF„  •  •  •  Fi 8a  coincides  with  the  marginal  of  p  on  the  root  of  the 
computation  tree,  while  BAFn  •  ■  •  F 1 5^  coincides  with  the  marginal  of  p  on  the  root 
of  the  tree:  this  is  easily  seen  by  expanding  the  above  nested  product  identity.  In 
particular,  we  obtain 


— Fn---F1(55||J=  ||p-p|| 


[0]j, 


and  we  aim  to  apply  the  comparison  theorem  to  estimate  this  quantity. 

The  construction  of  the  computation  tree  that  we  have  just  given  is  identical  to  the 
construction  in  Section  A. 6.  We  deviate  from  the  proof  of  Appendix  A  from  this  point 
onward,  since  we  must  use  Theorem  6.4  instead  of  the  classical  Dobrushin  comparison 
theorem  to  account  for  the  distinction  between  temporal  and  spatial  correlations  in 
the  present  setting. 

Fix  q  >  1.  In  analogy  with  the  proof  of  Theorem  C.14,  we  consider  a  cover  3 
consisting  of  blocks  of  sites  i  E  I  such  that  (l  —  1  )q  <  d(i)  <  Iq  A  n  and  v(i)  =  v.  In 
the  present  setting,  however,  the  same  vertex  v  is  duplicated  many  times  in  the  tree, 
so  that  we  end  up  with  many  disconnected  blocks  of  different  lengths.  To  keep  track 
of  these  blocks,  define 


I0  {i  E  I  :  d(i )  =  0},  Ii  :=  {i  E  I  :  d(i )  =  (l  —  1  )q  +  1} 
for  1  <  l  <  \n/q),  and  let 

t([Ku, . . . ,  Kn_x)v)  :=  max{s  >  u  :  Ku  =  Ku+1  =  •  •  •  =  Ks}. 
We  now  define  the  cover  3  as 

3  =  {Jl  :  0  <  l  <  \n/q ],  i  E  //}, 
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where 


Jo  :=  {?},  J\  :=  {[Ku  ■  ■  ■  Kn-i]v  :  (l  -  l)q  +  1  <  u  <  Iq  A  t(i)} 

for  i  =  ■  ■  ■  Kn-i]v  G  i)  and  1  <  l  <  \n/q\.  It  is  easily  seen  that  d  is  in  fact 

a  partition  of  of  the  computation  tree  /  into  linear  segments. 

Having  defined  the  cover  we  must  now  consider  a  suitable  coupled  update 
rule.  We  will  choose  the  natural  local  updates  7 i(dzJ)  =  p(dzJ\xI^J)  and  7 x(dzJ)  = 
p(dzJ\xI\J)}  with  the  coupled  updates  Qxz  and  Qx  to  be  constructed  below.  Then 
Theorem  6.4  yields 

II Fn  ■  ■  ■  Fn  •  •  •  F^hg- 1|  j  ||p  p||[0]j  ^  2  y  ^  y  ^  bj 

*e[0]J  jei 

provided  that  D  =  YAk=o^k  <  00  (cf-  Corollary  6.8),  where 


Ca  = 


sup 

x,zES: 

xi\W=zi\{j} 


bi  =  sup 

isS  . 


1  u^Qi^idu.du'), 


and  where  we  write  J{i)  for  the  unique  block  J  G  d  that  contains  *  G  I.  To  put  this 
bound  to  good  use,  we  must  introduce  coupled  updates  Qxz  and  Qx  and  estimate  Cl} 
and  bj. 

Let  us  fix  until  further  notice  a  block  ,J  —  J\  G  3  with  i  =  [Kp_i)q+i  ■  ■  ■  Kn_i]v  G  /; 
and  1  <  l  <  \n/q].  From  the  definition  of  p,  we  can  compute  explicitly 


A  (A)  = 

f  1  a{xj)  pv {xc^ ,  x‘)  n„gj+:  jnc(q)^0  Pv(a)  ( x°(a) » xa)  ribgj  9° (xb,  Yb)  ^ v ( dxb ) 

fpv(xc(i\  x*)  Uaei+:Jnc(a)^0Pv{a)(x^,xa)  fl beJ  gv{xb,  Y»)  r(dxb ) 


using  the  Bayes  formula.  We  now  proceed  to  construct  couplings  Qxz  of  7^  and  7/ 
for  iyG§  that  differ  only  at  the  site  j  G  I.  We  distinguish  the  following  cases: 


1.  d(j)  =  (l  -  1  )q  and  v(j)  7^  v; 

2.  d(j)  =  (l  -  1  )q  and  v(j)  =  v; 

3.  (I  —  1  )q  +  1  <  d(j)  <  Iq  A  f'(f)  and  v(j)  7^  n; 

4.  d(j)  =  Iq  A  f'(f)  +  1  and  v(j)  7^  u; 

5.  d(j)  =  Iq  A  +  1  and  v(j)  =  v. 

It  is  easily  seen  that  7^  does  not  depend  on  x J  except  in  one  of  the  above  cases.  Thus 
when  j  satisfies  none  of  the  above  conditions,  we  can  set  Caj  =  0  for  a  G  J. 

Case  1.  In  this  case,  we  must  have  j  G  c(i)  with  v(j)  7^  v.  Note  that 


7 t{A)> 

2  I  1a(xJ )  qV  (x1-,^)  YlaeI+:Jnc{a)?0PV{a)(xc{a\xa)  fl 


fteJ 


gv(xb:  Yb)  i/jv(dxb) 


I  q”{x*-,a«)  Ylaei+:Jnc(aV0Pv{a)(x«A,xa)  n6eJ  9v(xb,  Yb )  r(dxb 
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where  we  define  z_  G  c{i)  to  be  the  (unique)  child  of  i  such  that  v(i-)  =  v(i).  As  the 
right  hand  side  does  not  depend  on  ad,  we  can  construct  a  coupling  Qxz  using  Lemma 
C.15  such  that  Caj  <  1  —  e2  for  every  a  G  J  and  x,  z  G  §  such  that  xTYY  =  z:YY . 

Case  2.  In  this  case  we  have  j  =  f_.  Let  us  write  J  =  where 

u  =  card  J  and  d(ik)  —  (l  —  1  )q  +  k  for  k  —  1, . . . ,  u.  Thus  A  =  i,  and  we  define 
io  —  i-.  Let  us  also  write  Jk  =  {ik,  ■  ■  ■ ,  A.}-  Then  we  can  define  the  transition  kernels 
on  X.v 


Pk,xip^ i  -*4) 

f  1  A(xtk)  pv(ujxc(-tk^k-1 ,  xtk)  Y[jkrc(a)^0  i xa)  ribeJfc  9V(xby  Yb)  ipv(dxb) 

f  pv(uxc(ik)\ik-i,  x*)  Tljknc(a)^0Pv{a)(xC{a)>xa)  El 6eJfe  9v{'xb,  Yb )  ^ v(dxb ) 

for  k  —  1, . . . ,  u.  By  construction,  Pk,x{xtk~1i  dxlk)  =  'yx(dxlk\xn, . . .  so  we  are 

in  the  setting  of  Lemma  C.16.  Moreover,  we  can  estimate 


Pk,x{ui,A)>e252 


1 1  A{xik)  Iljknc(a)^0Pvia)(xC{a)’xa)  ribeJ,  9v(xb Yb)  V(dxb 
f  Ujknc(a^0Pv{a)(xC(a\xa)  Ubejk  9v(xb,  Yb )  ^(dx») 


Thus  whenever  i,zgS  satisfy  xTYY  =  z CO'},  we  can  construct  a  coupling  QJX  z  using 
Lemma  C.16  such  that  ClkJ  <  (1  —  e2S2)k  for  every  k  —  1, . . . ,  u. 

Case  3.  In  this  case  we  have  j  G  u  aer+:Jnc(a)^0C(a)  or  J  ncU)  +  with 
v (j )  7^  v.  Let  us  note  for  future  reference  that  there  are  at  most  qA2  such  sites  j. 
Now  note  that 


7 t(A)  >  e2(A+1)  X 

1  a{xj)  pv x*)  rLg/+:jnc(a)^0  Pa{xc{a\xa)  Ilbgj  9°{xh,  Yb )  $ v(dxb ) 
/p"(x40,x*)nae7+:Jnc(a^  ’ 

where  we  have  defined  j3a(xc(a\xa)  =  qv^a\xa~ ,  xa)  whenever  j  —  a  or  j  G  c(a),  and 
/3a(xc(“),  x“)  =  ^(a)(xc(a);  xa)  otherwise.  The  right  hand  side  of  this  expression  does 
not  depend  on  ad  as  the  terms  qAa>  (xa~ ,  xa)  for  v(a)  ^  v  cancel  in  the  numerator 
and  denominator.  Thus  whenever  iyGS  satisfy  xIYJ\  =  zTYJ\  we  can  construct  a 
coupling  QJXZ  using  Lemma  C.15  such  that  Caj  <  1  —  £2iA+1)  for  every  a  G  J. 

Case  4.  In  this  case  J  D  c(j)  ^  0  with  v(j)  ^  v.  Note  that 


it  {A)  > 

_2  /  1  a{xj)  pv{xc®,xi)  n«gj+:jnc(a)^0  !da{xc{a\xa)  n,eJ  9v(x\  Yb)  i/jv(dxb) 

I  p«(x*),xi)  naei+:jnc(a^0  Pa(x<a\x“)  ELj  9v(xb,  Yb)  r(dxb ) 


where  /3a(xc(a\xa)  =  qYa\xa~ ,  xa)  when  j  =  a,  and  /3a(xc(a\xa)  =  pv^a\xc^a\xa) 
otherwise.  The  right  hand  side  does  not  depend  on  ad  as  the  term  qYY  (ad- ,  x t)  cancels 
in  the  numerator  and  denominator.  Thus  whenever  i,zG§  satisfy  xJYY  =  zTYY ; 
we  can  construct  a  coupling  Qxz  using  Lemma  C.15  such  that  Caj  <  1  —  e2  for  every 
CL  G  J . 


191 


Case  5.  In  this  case  we  have  j_  G  J.  Note  that  the  existence  of  such  j  necessarily 
implies  that  d(i)  >  Iq  by  the  definition  of  J.  We  can  therefore  write  J  =  {A, ...  ,iq} 
where  d(ik)  =  Iq  —  k  +  1  for  k  —  1 , . . . ,  g,  and  we  define  i0  =  j.  Let  us  also  define  the 
sets  Jk  =  {ik,  ■  ■  ■  ,iq}-  Then  we  can  define  the  transition  kernels  on  Xv 


Pk,x(k) i  -T) 

I  1  A{xik)pV{xC^\x^)  ria6J+:Jfcnc(a)^0  K(x<a\xa)  fl  bGJk  9V{xb,  Yh )  j)v  (dxb) 
fpv(xc(iq)jXiq)  UaGl+.Jhnc{a^0  PZ{x^),x“)  n6eJfc  9v(xb ,  Yb)  r(dxb ) 


for  k  =  1  ,  ...,g,  where  j3“(xc(a\xa)  =  pv(xc(-a\  ui)  if  a  =  ik~  i  and  /3“  ( x^> ,  xr)  = 


pv(a) (xc(a) ^  xa)  otherwise.  By  construction  PkjX(xtk-1,  dxlk)  =  r)x{dxlk \x11 , . . . ,  x^-1) , 
so  we  are  in  the  setting  of  Lemma  C.16.  Moreover,  we  can  estimate 


Pk,x{u,A)  >  e  8  x 

f  lA(xlk)  pV(xc(lq\xlq)Y[ 


a£l+:  Jj.nc(a)^0 


/3a(xc(-a\  xa)  n 


b£Jk 


gv(xb ,  F6)  ipv(dxb 


fpv(xcM,x*«)  n<iei+Jknc(a^0  Pa(xc{a\xa)  nbejfe  gv(xb,  Yb ) 


where  f3a(xc(a\  xa)  =  1  if  a  =  ife_i  and  j3a{xc^a\  xa)  =  pv^(xc^a\xa)  otherwise.  Thus 
whenever  re,  *  G  §  satisfy  rAU}  —  ^Ah})  we  can  construct  a  coupling  using 
Lemma  C.16  such  that  Cikj  <  (1  —  e282)k  for  every  k  —  1, . . . ,  q. 

We  have  now  constructed  coupled  updates  Qxz  for  every  pair  x,  z  G  §  that  differ 
only  at  one  point.  Collecting  the  above  bounds  on  the  matrix  C,  we  can  estimate 


J2eP\dH-dU) \Caj  <  3gA2e^(l  -  e2(A+1))  +  ep(l  -  £2h2)  +  e^(l  -  e282)q  =:  c 

whenever  a  G  J,  where  we  have  used  the  convexity  of  the  function  ax+l  +  aq~x. 

Up  to  this  point  we  have  considered  an  arbitrary  block  J  =  J\  G  d  with  1  <  l  < 
\n/q\.  However,  in  the  remaining  case  l  =  0  it  is  easily  seen  that  7^  =  8aj  does  not 
depend  on  x,  so  we  can  evidently  set  Caj  =  0  for  a  G  J.  Thus  we  have  shown  that 

IICII 00,0m  :=ma 

i£l  L — * 

jel 


where  we  dehne  the  pseudometric  m(i,j)  =  \ d(i)  —  d(j)  |.  On  the  other  hand,  in  the 
present  setting  it  is  evident  that  7^  =  7^  whenever  J  =  J\  G  d  with  1  <  l  <  \n/q]. 
We  can  therefore  choose  couplings  QJX  such  that  6*  <  ld(i)=o  for  all  i  G  I.  Substituting 
into  the  comparison  theorem  and  arguing  as  in  the  proof  of  Theorem  C.14  yields  the 
estimate 

||  F„  ■  •  •  Fk5ct  -  Fn  ■  ■  •  F1(y  j  <  card  J  e-|Sn. 

1  —  c 

Thus  the  proof  is  complete.  Q 


Proposition  C.18  provides  control  of  the  block  filter  as  a  function  of  time  but  not 
as  a  function  of  the  initial  conditions.  The  dependence  on  the  initial  conditions  can 
however  be  incorporated  a  posteriori  as  in  the  proof  of  Proposition  A.  15.  This  yields 
the  following  result,  which  forms  the  basis  for  the  proof  of  Theorem  C.20  below. 
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Corollary  C.19  (Block  filter  stability).  Suppose  there  exist  0  <  e,  S  <  1  such  that 


£qv(xv,zv)  <pv(x,zv)  <e  1qv(xv,zv), 

s  <  qv(x\zv )  <  r1 


for  every  v  £  V  and  x,  z  £  X,  where  ^  :  X'  x  X”  4  M+  is  a  transition  density  with 
respect  to  tfv.  Suppose  also  that  we  can  choose  q  £  N  and  (3  >  0  such  that 

c  :=  3gAV9(l  -  e2(A+1))  +  e/3(l  -  £262)  +  e^(l  -  £2S2)q  <  1. 

Let  fi  and  v  be  (possibly  random)  probability  measures  on  X  of  the  form 


Kex  kgx 


Then  we  have 


W  '  '  '  Fh.1I/IIj  A 


1  —  c 


card  Je~^n~s\ 


as  well  as 

E[\\Fn---Fs+1ii-Fn---Fs+1u\\2\1/2 

<  j-2—  .  card  J(e~^Ax)n~s  max  E[||/iA  -  uK\\2}1/2, 

for  every  s  <  n,  K  £  X  and  J  C  K . 

Proof.  The  proof  is  a  direct  adaptation  of  the  proof  of  Proposition  A.  15.  D 

The  block  filter  stability  result  in  Appendix  A  is  the  only  place  in  the  proof  of 
the  variance  bound  where  the  inadequacy  of  the  classical  comparison  theorem  plays 
a  role.  Having  exploited  the  generalized  comparison  Theorem  6.4  to  extend  the 
stability  results  in  Appendix  A  to  the  present  setting,  we  would  therefore  expect  that 
the  remainder  of  the  proof  of  the  variance  bound  follows  verbatim  from  Appendix 
A.  Unfortunately,  however,  there  is  a  complication:  the  result  of  Corollary  C.19 
is  not  as  powerful  as  the  corresponding  result  in  Appendix  A.  Note  that  the  first 
(uniform)  bound  in  Corollary  C.19  decays  exponentially  in  time  n,  but  the  second 
(initial  condition  dependent)  bound  only  decays  in  n  if  it  happens  to  be  the  case 
that  e-/3A;jc  <  1.  As  in  Appendix  A  both  the  spatial  and  temporal  interactions  were 
assumed  to  be  sufficiently  weak,  we  could  assume  that  the  latter  was  always  the  case. 
In  the  present  setting,  however,  it  is  possible  that  e~^A%  >  1  no  matter  how  weak 
are  the  spatial  correlations. 

To  surmount  this  problem,  we  will  use  a  slightly  different  error  decomposition  than 
was  used  in  Appendix  A  to  complete  the  proof  of  the  variance  bound.  The  present 
approach  is  inspired  by  [  ] .  The  price  we  pay  is  that  the  variance  bound  scales  in  the 
number  of  samples  as  A-7  where  7  may  be  less  than  the  optimal  (by  the  central  limit 
theorem)  rate  It  is  likely  that  a  more  sophisticated  method  of  proof  would  yield 


193 


the  optimal  N  2  rate  in  the  variance  bound.  However,  let  us  note  that  in  order  to  put 
the  block  particle  filter  to  good  use  we  must  optimize  over  the  size  of  the  blocks  in  X, 
and  optimizing  the  error  bound  in  Theorem  6.13  yields  at  best  a  rate  of  order  N~a 
for  some  constant  a  depending  on  the  constants  /31;  /32-  As  the  proof  of  Theorem  6.13 
is  not  expected  to  yield  realistic  values  for  the  constants  /3i,/?2,  the  suboptimality  of 
the  variance  rate  7  does  not  significantly  alter  the  practical  conclusions  that  can  be 
drawn  from  Theorem  6.13. 

We  now  proceed  to  the  variance  bound.  The  following  is  the  main  result  of  this 
section. 

Theorem  C.20  (Variance  term).  Suppose  there  exist  0  <  e,  6,  k  <  1  such  that 

eqv(x\  zv)  <  pv(x,  zv )  <  £~1qv(xv,  zv ), 
s  <  qv{xv,zv )  <  r\ 

«  <  gv(xv,yv )  < 

for  every  v  G  V,  x,  z  G  X,  and  y  G  Y,  where  g"  :  X^xX11  M+  is  a  transition  density 
with  respect  to  ifv .  Suppose  also  that  we  can  choose  q  G  N  and  j3  >  0  such  that 

c  :=  3gAV9(l  -  e2(A+1))  +  e^l  -  £252)  +  e^(l  -  £252)q  <  1. 

Then  for  every  n  >  0,  a  G  X,  K  G  X  and  J  C  K,  the  following  hold: 

1.  If  e~ 13  Age  <  1,  we  have 

32A;jc  2  —  e~/3Ax  (s^/cA*)-4^00 
1  —  c  1  —  e~/3Ax  IV2 


7T„, 


K\\\j  <  card  J 


2.  If  e  13  Ax  =  1,  we  have 


I  ~<7  A-Cr|||  ^ 

Cn  Tilllj  A 


card  j  AlA  3  +  l°gJV 

1-c  N-2 


3.  If  e  ^Ax  >  1,  we  have 


IK  —  ^nlllj  A  card  ^ 

"  n  nmj  -  1  -  c  I  e_/3A 


+  2 


)  (£<5kA;k)  4|3CI° 

J  _/\T  2  log  Ax 


The  proof  of  Theorem  C.20  combines  the  stability  bounds  of  Corollary  C.19  and 
one-step  bounds  on  the  sampling  error,  Lemma  A.  17  and  Proposition  A. 20,  that  can 
be  used  verbatim  in  the  present  setting.  We  recall  the  latter  here  for  the  reader’s 
convenience. 

Proposition  C.21  (Sampling  error).  Suppose  there  exist  0  <  e,  6,  k  <  1  such  that 

£qv(xv,zv)  <pv(x,zv)  <  £~1qv(xv,zv), 

6  A  qv(xv,zv)  <  (T1, 

«  A  gv(xv,yv)  A  «_1 
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for  every  v  G  V ,  x,  z  G  X;  and  y  E  Y.  T/rnn  we  have 


max  |||Fn7r^_1 


cr 

n— 1 


2k  2FI°° 

N I 


and 


max  E  [|  I F  s+ 1  Fs7tJ_  l 


Fs+i  Fs7rs_ 


A'eac 

for  every  0  <  s  <  n  and  a  e  X. 


liy  -  N\ 


—  2|3C|oo  —  4|X|oo 


Proof.  Immediate  from  Lemma  A. 17  and  Proposition  A. 20  upon  replacing  £  by  eS.  □ 
We  can  now  prove  Theorem  C.20. 

Proof  of  Theorem  C.20.  We  fix  for  the  time  being  an  integer  t  >  1  (we  will  optimize 
over  t  at  the  end  of  the  proof).  We  argue  differently  when  n  <t  and  when  n  >  t. 
Suppose  first  that  n  <  t.  In  this  case,  we  estimate 


I  77 n  77 n  1 1 1 J 


=  |||Fn  •  ■  •  F i5a  —  F,„  •  •  ■  FF, 


O’ III  J 


n 


—  /  |||Fn  ‘  "  "  Ffc_|_iFfc7Tfc_1  Fn  •  •  •  F/c_|_iFfc7Tfc_1| 


k= 1 


using  a  telescoping  sum  and  the  triangle  inequality.  The  term  k  =  n  in  the  sum 
is  estimated  by  the  first  bound  in  Proposition  C.21,  while  the  remaining  terms  are 
estimated  by  the  second  bound  of  Corollary  C.19  and  Proposition  C.21,  respectively. 
This  yields 


Fn  77 n 


I T  <  card  J  — 


32Ax  (e5kAx)  4FF  ( ( e  A%) 


As 


\n—  1 


A 


+  1 


x 


(in  the  case  e  13  Ax  =  1,  the  quantity  between  the  brackets  {  •  }  equals  n ). 
Now  suppose  that  n  >  t.  Then  we  decompose  the  error  as 


77 n  77 n  111,7  —  III  ^n  Fn-t+l^n-t  Fn  •  •  •  Fn_j_|_i7Tn_ 


E 

k=n-t-\- 1 


Ffc+i  F^7r^,_ i  Fn  •  •  •  F/c+1F/c7ta,_1| 


that  is,  we  develop  the  telescoping  sum  for  t  steps  only.  The  first  term  is  estimated 
by  the  first  bound  in  Corollary  C.19,  while  the  sum  is  estimated  as  in  the  case  n  <  t. 
This  yields 


Ij< 


card  J 


2e~^  + 


32A3c(£5ftAx)  4^°°  f  ( e 


As 


\t—i 


i~PA 


+  1 


x 


(in  the  case  e  13 Ax  =  1,  the  quantity  between  the  brackets  {  •  }  equals  t). 

We  now  consider  separately  the  three  cases  in  the  statement  of  the  Theorem. 
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Case  1.  In  this  case  we  choose  t  =  n,  and  note  that 


-PAnc)"-1  -  1  .  2- e~^Ax 


;  PAX  -  1 


+  1  < 


1  -  e~P  A 


for  all  n  >  1. 


x 


Thus  the  result  follows  from  the  first  bound  above. 
Case  2.  In  this  case  we  have 


~<7  A.CT \\\  ^ 

Fn  Filllj  — 


card  J 


2  e-pt  +  32AJC(e<teAx)-4lJcl‘ 

Nl 


for  all  t,n  >  1.  Now  choose  t  =  [(2 /3)  1  log  IV].  Then 
card  J 


'  n  nn\\\j<  1_c 


1  .  A  \  4I3CI  logiV  34A3c(ehKA:K')  4l3Clc 

16)9_1A3C(e(5/cAx)_4|3C|00  +  v  ’ 


N  5 


which  readily  yields  the  desired  bound. 
Case  3.  In  this  case  we  have 


|  ~  cr  A-oiW  ^ 

Fn  Filllj  —  ^ 


card  J 


2e_*  +  32A3C(gh^)-4l^l- 

AT5 


As 


-  1 


2-^A 


+  1 


for  all  t,n>  1.  Now  choose  t  = 


\K  -  K\\\j  <  card  J  - 


log  JV 
2  log  Ax 

32Ak 


Then 

1 


i~PA 


+  2 


3C 


x 


A:xO— 4|3C|c 


N  2  loS  a3C 


and  the  proof  is  complete. 


□ 


The  conclusion  of  Theorem  6.13  now  follows  readily  from  Theorems  C.14  and  C.20. 
We  must  only  check  that  the  assumptions  Theorems  C.14  and  C.20  are  satisfied.  The 
assumption  of  Theorem  C.14  is  slightly  stronger  than  that  of  Theorem  C.20,  so  it 
suffices  to  consider  the  former.  To  this  end,  fix  0  <  6  <  1,  and  choose  g  6  N  such 
that 

1  -  <52  +  (1  -  52y  <  1. 

Then  we  may  evidently  choose  0  <  £0  <  1,  depending  on  S  and  A  only,  such  that 


3gA2(l  -  £.2(A+r) )  +  x  _  e2g2  +  ^  _  £2g2y  <  x 


for  all  £0  <  £  A  1-  This  is  the  constant  £0  that  appears  in  the  statement  of  Theorem 
6.13.  Finally,  it  is  now  clear  that  we  can  choose  /3  >  0  sufficiently  close  to  zero 
(depending  on  5,e,r,  A  only)  such  that  c  <  1.  Thus  the  proof  of  Theorem  6.13  is 
complete. 
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Appendix  D 


Nonlinear  filtering  in  infinite 
dimension:  proofs 


This  appendix  is  devoted  to  the  proof  of  Theorem  7.7.  The  proof  relies  on  standard 
tools  from  statistical  mechanics  [7,  77]:  a  Peierls  argument  for  the  low  noise  regime 
and  a  Dobrushin  contraction  method  for  the  high  noise  regime. 

D.l  Proof  of  Theorem  7.7:  low  noise 

We  begin  by  noting  that  as  (X%,Y£)kjVez  and  (—X%,Y£)kjVez  have  the  same  law,  it 
follows  that  E(A°|Yi, . . . ,  Yk)  =  E(— Xk\Yk, . . . ,  W),  and  we  therefore  have 

E(A°|W, . . . ,  Yk)  =  0  for  all  k  >  1. 

To  prove  that  the  filter  is  not  stable,  it  therefore  suffices  to  show  that 

infE|E(Xfc°|X0,Ei,...,W)l  >0. 

k>  1 

To  show  this,  we  begin  by  reducing  the  problem  to  finite  dimension. 

Lemma  D.l.  Suppose  that  0  <  p  <  1/2.  Then 


Proof.  Let  f3  :=  log  ^/(l  —  p)/p  >  0.  We  begin  by  noting  that 


for  1  <  l  <  k  and  y  G  {  —  1, 1}.  Define  the  probability  measure  Q  such  that 


i=i 
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Then  under  Q,  the  observations  Y™  and  1  <  £  <  k  are  symmetric  Bernoulli 

and  independent  from  all  the  remaining  variables  in  the  model,  while  the  remainder 
of  the  model  is  the  same  as  defined  above.  In  particular,  this  implies  that 

{A"o,  Xg,  Yfv  :!<£<  k,  |u|  >  m}  X  {X£,  X/,  Y/  :  1  <  £  <  k,\v\  <  m}  under  Q. 

We  therefore  obtain  using  the  Bayes  formula  (Theorem  2.7) 

P(7L|X0,  Y,...,  Ykl  {XI  . . . ,  X£  :  |u|  >  m})  = 

EQ(lAg|X0,  Y|, . . . ,  Yk,  {XI,  . . . ,  X£  :  \v\  >  m}) 
EQ(g|X0,Yi,...,Yfc,W,...,XMu|  >m}) 

>  e~^k  Q(A\X0,  Yu  ,  Yk,  {X”,  ...,Xvk:  |u|  >  m}) 

=  e-*f>kQ(A\X0,Y1,...,Yk) 

for  any  A  G  cr{X0,  Yi, . . . ,  Yfc,  X{, . . . ,  Xk  :  \v\  <  m}. 

Define  Y°  :=  (X°, . . . ,  Xg)  and  Z~m  :=  (Xxm, . . . ,  X™,  Xfm, . . . ,  X~m)  for  m  >  1. 
Due  to  the  conditional  independence  structure  of  the  infinite-dimensional  filtering 
model, 


E  (f(Z~m)  |X0,  Yi,  •  •  • ,  Yk,  Z~m~\  Z~m~2,  ...)  =  E  (f(Z~m)  |X0,  Yu  . . . ,  Yk,  Z~m~x) 

for  every  m  >  0.  Thus  ( Zm)m<0  is  a  Markov  chain  under  any  regular  version  of 
the  conditional  distribution  P(  •  |X0,  Y1; . . . ,  Yk)  (almost  surely  with  respect  to  the 
realization  of  Xq,  Y1; . . . ,  Yk).  Moreover,  the  above  estimate  shows  that  the  (random) 
transition  kernels  of  this  Markov  chain  satisfy  the  Doeblin  condition  [38,  Theorem 
16.2.4],  so 

|E(X°|X0,  Yx, . . . ,  Yk,  {XI  ...,Xvk:  |u|  >  m})  -  E(X°|X0,  Yu  . . . ,  Yfc)| 

<  2(1  -  e-4/3fc)m+1 


for  all  m  >  0.  This  completes  the  proof.  D 

Lemma  D.l  reduces  our  problem  to  a  finite-dimensional  one.  Indeed,  it  is  clear 
that  the  filter  is  not  stable  for  p  =  0  (for  precisely  the  same  reason  as  in  Example 
7.1),  so  we  will  assume  without  loss  of  generality  in  the  sequel  that  0  <  p  <  1/2. 
Applying  Lemma  D.l,  it  follows  that  in  order  to  prove  that  the  filter  is  not  stable,  it 
suffices  to  show  that 


inf  E|E(X°|X0,Y1,...,Yfc,{X7,...,XMu|  >  m})\  >  0. 

/c,m>  1 


But  the  conditional  independence  structure  of  the  infinite-dimensional  filtering  model 
implies  that  the  conditional  expectation  inside  this  expression  depends  only  on  X/ 
and  Yg  for  0  <  £  <  k  and  |u[  <  m  +  1.  We  are  thus  faced  with  the  problem  of 
obtaining  a  lower  bound  on  this  finite-dimensional  quantity  that  is  uniform  in  k,  m. 

To  lighten  the  notation,  it  will  be  convenient  to  view  (Xk)kjVeZ  not  as  a  sequence 
of  spatial  random  fields  on  Z,  but  rather  as  a  single  space-time  random  field  on  Z2. 
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To  this  end,  we  will  write  Xq  :=  Xk  for  q  =  (k,v)  G  Z2.  We  will  similarly  write 
Yqr  ■=  Y”  and  fqr  :=  if  q  —  (k  —  l,v)  and  r  =  (k,v),  and  Yqr  :=  Yk  and  £ qr  := 
if  q  —  (k,v)  and  r  =  (k,v  +  1)  (the  order  of  the  indices  q,r  is  irrelevant,  that  is, 
Yqr  :=  Yrq  etc.)  In  this  manner,  we  can  view  X  =  (Xq)q£%2  as  a  random  held  on  the 
lattice  Z2,  with  observations  Yqr  attached  to  each  edge  {q,  r}  C  Z2  with  ||g  —  r||  =  1. 

Lemma  D.2.  Suppose  that  0  <  p  <  1/2,  and  let  k,m  >  1.  Define  the  quantities 
(3  :=  log  p)/p,  J  [1,  k ]  x  [— m,  m\,  and  dJ  :=  {0}  x  [— m,  m\  U  [1,  k]  x  {— m  — 

1  ,m  +  1}.  For  any  given  configuration  x  G  {  —  1,  l}z2,  we  define  the  random  measure 
E  on  {  —  1, 1}J  as 

V  l  {<l,r}QJ-\\q— f||=l  q£J,r£dJ:\\q—r\\=l 

where  Z  is  the  normalization  such  that  EX(J)  =  1.  Then 

E  ((Xq)qeJ  G  A\X0,  Yu  . . . ,  Yk,  {XI,  ...,Xvk:  |n|  >  m})  =  E  X(A). 

Proof.  By  the  conditional  independence  structure  of  the  filtering  model,  we  have 
E  ((Xq)qGj  G  A\X0,  W, . . . ,  Yk,  {XI  ...,Xf:  |u|  >  m})  = 

E((A"'2)qgJ  G  A\(Xq)q(zdJ,  {Yqr)q&J^r(zJVjdJ^q_r\\  =  1). 

The  joint  distribution  of  the  random  variables  that  appear  in  this  expression  is 

P((A'»)J£Ju8J  =  USJ, 11,-11-1  =  y)  =  2-|JuSJ|  x 

n  n 

{<1X}QJ'-\\<1— r[|=l  q£J,r&8J:\\q- r||  =  l 


where  |A|  denotes  the  cardinality  of  a  set  A.  The  result  now  follows  readily  from  the 
Bayes  formula  (Theorem  2.7)  and  the  fact  that  Yqr  =  XqXr£qr  by  construction.  □ 

Lemma  D.2  shows  that  the  distribution  P(  •  |A0,  Yj, . . . ,  Yk,  {A"/, . . . ,  Xk  :  |n|  > 
m})  has  a  familiar  form  in  statistical  mechanics:  it  is  (up  to  the  change  of  variables  or 
gauge  transformation  aq  =  xqzq )  an  Ising  model  with  random  interactions,  also  known 
as  a  random  bond  Ising  model  or  an  Ising  spin  glass,  with  inverse  temperature  ft  = 
log  \/ (1  —  p)/p.  The  failure  of  stability  of  the  filter  for  large  fi  can  now  be  addressed 
using  a  standard  method  in  statistical  mechanics  [7,  section  6.4].  For  concreteness, 
we  include  the  requisite  arguments  in  the  present  setting,  which  completes  the  proof. 

Proposition  D.3.  There  exists  an  absolute  constant  0  <  p*  <  1/2  such  that 

E  |E(A°|A0,  W, . . . ,  Yk,  {XI,  ...,Xf:  |u|  >  m})\  >  \ 

for  every  k,  m  >  1  whenever  0  <  p  <  p*. 
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Proof.  Let  us  fix  k,m>  1  throughout  the  proof,  and  define  0  :=  (k,0)  G  J.  We  will 
prove  below  the  following  claim:  there  exists  an  absolute  constant  0  <  p*  <  1/2  such 
that 

P  :  z°  =  I0})  >  >  1 

whenever  0  <  p  <  p*:  that  is,  when  the  noise  is  sufficiently  small,  the  conditional 
distribution  P(A°  G  ■  |Ao,  Yf, . . . ,  Yk,  {A", . . . ,  Xk  :  |u|  >  m})  assigns  a  large  proba¬ 
bility  to  the  actually  realized  value  of  Xk  at  least  half  of  the  time  (recall  Lemma  D.2). 
Let  us  complete  the  proof  assuming  this  claim.  Note  that  Yx({z  :  z°  =  x0})  >  3/4 
implies  |SX({^  :  z°  =  1})  —  Yx({z  :  z°  =  —  1 }) |  >  1/2.  Thus  the  above  claim  implies 
that 


P  [  |E(A°|A„,  Li, . . . ,  Yk,  {XI  . . . ,  A£  :  |n|  >  m})\  >  - 


AT 


,  Xi 


where  we  have  used  Lemma  D.2  and  the  fact  that  {A9}  and  are  independent. 

The  proof  is  now  completed  by  a  straightforward  estimate. 

ft  remains  to  prove  the  above  claim.  To  this  end,  we  use  a  Peierls  argument.  Fix 
for  the  time  being  a  configuration  z  G  {  —  1, 1}J.  For  any  J'  C  J,  define  the  boundary 
edges 

<EJ'  :=  {{g,  r}  :  g  G  /,  r  G  ( J\  J')  U  dJ,  \\q  —  r||  =  1}. 

A  subset  J’  C  J  is  called  a  contour  if  it  is  simply  connected,  zq  =  —xq  for  all 
{ q,r }  G  CBJ7  with  q  G  J' ,  and  zr  =  xr  if  in  addition  r  G  J\J' .  We  will  denote  the 
set  of  contours  as  £ZtX  (note  that  the  definition  of  a  contour  depends  on  the  given 
configurations  z  and  x).  If  z°  =  —x°,  then  there  must  exist  a  contour  J'  G  <£ZiX  such 
that  0  G  J':  indeed,  construct  J'  by  choosing  the  maximal  connected  subset  of  J 
such  that  0  G  J'  and  zq  =  —  xq  for  all  q  G  J',  and  then  “fill  in  the  holes”  to  make  J' 
simply  connected.  Thus 


S"({^  :  z°  =  -x0})  <  Yx({z  :  3  J'  G  £z>x,  0  G  J'})  <  J]  Yx({z  :  J'  G  £*,*}). 

J'BO 


Now  note  that,  by  the  definition  of  a  contour,  xqzq  =  —  1  whenever  {q,  r}  G  with 
q  G  J',  and  xqxr zqzr  =  —  1  if  in  addition  r  G  J\J' .  Thus  the  existence  of  a  contour 
implies  the  presence  of  many  such  edges.  The  basic  idea  of  the  proof  is  that  the 
probability  that  this  occurs  is  small  under  Yx  due  to  Lemma  D.2.  Let  us  make  this 
precise. 


Lemma  D.4.  For  any  J'  C  J,  we  have 


S*({£  :  J'  e  £z,x})  <  exp  (  -2/3  ^  fqr 

^  (g,r}e£J' 

Proof.  Assume  without  loss  of  generality  that  J'  is  simply  connected.  Let  us  use  for 
simplicity  the  convention  that  zr  =  xr  for  r  G  dJ.  Define  the  events 


A  =  {z  :  zq  =  —xq  and  zr  =  xr  for  {q,r}  G  <£J',  q  G  J'}, 
B  =  {z  :  zq  =  xq  and  zr  =  xr  for  {q,  r}  G  <EJ\  q  G  J'}. 
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Then  we  evidently  have  by  Lemma  D.2 


T,x({z  :  f  G  £z,x})  =  S"(A)  < 


S  X(A) 
E  *(B)' 


An  elementary  computation  shows  that 
E*(4) 


EX(B) 


=  exp 


y-  qr\  E2  U(Z)  eXP(/3  E{g,r}CJ>:||g-r||=l  X*Xr & Zr) 
{q^j,  4  /  E,  exp(/3  E{g,r}CJ':||g-r||=l  &#***)  ' 


But  the  ratio  in  this  expression  is  unity,  as  the  exponential  term  inside  the  sums  is 
invariant  under  the  transformation  zq  t— y  —zq  for  all  q  G  J' .  The  proof  is  complete.  D 

Lemma  D.4  allows  us  to  estimate 


P  [  Zx({z  :  z°  =  -x0})]  >  J2 

J'3 o  simply  connected 

<p(  E  -pf-2/3  £  r)>  E 

V  J'3 o  simp.  conn.  {g,r}e£J'  j's o  simp.  conn. 

_ ^  fc  jt  |  \ 

3  /  3  0  simply  connected  with  E  ^qr  <  I  I 

|<7,rie<£J'  J 

<  E  pf  E 


-/3|£J'| 


(g,r}e<£J' 

ICJ'I 


j'90  simply  connected  V  (g,r}e£J' 

Using  a  standard  combinatorial  result  [  7,  Lennua  6.13] 

|{J'  C  J  simply  connected  :  0  G  J' ,  |<£J/|  =  /}|  < 
as  well  as  the  simple  bound 

p(  E  r  =P  (bui(|(EJ'|,  l-p)<^|C:J,|N)  <  2l£JV£J'l/4, 

\{q,r}&<£J'  /  \ 

we  can  conclude  that 


P  E x({z  :  =  -x0})]  >  Cl  <  c2,  c1  =  E 


oo  /  \  1/2 

i-i  f  p 


1=3 


1  —  p 


c2  =  ^2l3l~12 

1=3 


~lr>  lp1/4. 


But  we  can  now  evidently  choose  p*  >  0  sufficiently  small  such  that  C\  <  1/4  and 
c2  <  1/2  whenever  p  <  p*,  which  readily  yields  the  desired  estimate.  P 
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D.2  Proof  of  Theorem  7.7:  high  noise 

We  now  turn  to  proving  that  the  filter  is  stable  when  the  noise  is  strong.  We  begin  by 
noting  that  it  suffices  to  prove  stability  of  finite- dimensional  marginals  of  the  filter. 

Lemma  D.5.  Suppose  that 

E  |E(/(AV", . . .  ,XP)\X0,  Y„  E(/(Af\ ....  X?)\YU  . . . ,  Yk)\  ^  0 

for  every  function  f  and  every  m  >  1 .  Then  the  filter  is  stable. 

Proof.  Fix  any  measurable  subset  A  of  {— 1, 1}Z  and  define 

Fm  =  fm{X 0-m, . . . ,  W0m)  :=  P(X0  G  A|X0-m, . . . ,  X™). 

We  can  estimate 


E \P(Xk  G  A\X0,  Yu  ...,Yk)-  P(Xk  G  A\Y1} . . . ,  Yk)\ 
<2E\fm(X-m,...,X™)-U(Xk)\ 

+  E  |E  (fm{X^m, . .  .,X?)\X0,  Yu  . . . ,  Yk)  -  E  (fm(Xkm,  •  •  ■ ,  X™)\Y1} . . . ,  Yk)\. 
By  stationarity  the  first  term  does  not  depend  on  k,  and  the  assumption  gives 
limsupE  \P(Xk  e  A\X0,  Yu  . . . ,  Yk)  -  P(Xk  e  A\YU  . . . ,  Yk)\  <  2  E \Fm  -  1A{X0)\. 

k—>  oo 

Letting  m  — >  oo  and  using  the  martingale  convergence  theorem  concludes  the  proof. 

□ 

We  will  in  fact  prove  a  much  stronger  pathwise  bound  than  is  required  by  the 
above  lemma.  The  basic  tool  we  will  use  for  this  purpose  is  the  Dobrushin  comparison 
theorem  (Theorem  2.11),  which  we  state  here  in  a  convenient  form. 

Theorem  D.6  (Dobrushin  comparison  theorem).  Let  p  and  v  be  probability  measures 
on  {— 1,  l}7  for  some  countable  set  I,  and  choose  measurable  functions  such 

that 


mi(X)  =  p(Xi  =  1|{W  :  j  ±  0),  <X)  =  u(Xi  =  1\{X*  :  j  ^  i}). 


Define 


bi  :=  sup  | infix)  —  nfix) |,  Cp  :=  sup  | mfix)  —  mfiz) |, 

x  x,z:xv=zv  for  v^i 

and  assume  that 

sup  EQ‘<L 
F1  iei 

Then  D  :=  Y^=oCn  exists  (in  the  sense  of  matrix  algebra),  and 

\hf  ~  vf  \  < 

j£j  iei 

whenever  J  is  a  finite  set,  f(x)  depends  only  on  :  j  G  J},  and  0  <  /  <  1. 
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We  will  apply  this  result  pathwise  to  compare  the  filters  with  and  without  con¬ 
ditioning  on  the  initial  condition.  To  this  end,  we  must  compute  the  quantities  that 
arise  in  the  Dobrushin  comparison  theorem  for  suitably  chosen  regular  conditional 
probabilities. 

Lemma  D.7.  Fix  any  version  of  the  regular  conditional  probabilities 
pX)Y  '■=  P(X0,  •  •  • ,  Xk  G  •  |X0,  Yi, . . . ,  Yk),  vY  '■=  P(A0,  •  •  • ,  Xk  G  •  |Yi, . . . ,  Yk). 
Then  there  is  a  set  A  with  P((X,  Y)  G  A)  =  1  such  that  for  every  (x,  y)  G  A 


Tx,y(Xf  =  1\{X?  :  (r,w)  ±  (M)D  =  vy[Xf  =  1\{X?  :  (r,w)  ±  (£,v)})  = 

emxFx+yixi+i+n+1xii+1 +$r1^r1i 


emxFi+vlxve+ l+S!+1xs+1+yi- lx*- L}  +  e-mixj_1+yjx^ AyJ+1X-+1+yJ~ lx”- 
for  1  <  £  <  k  and  jjgZ, 


PxM  =  1\{X?  :  (r,w)  ±  (MOD  =  VyW  =  1\{X?  :  (r,w)  ?  (M)D  = 

PP{  r.X^+ylXl+Xyl^xr1} 


^{yixi-.+vtxv Ayr lxvk~ r  +  ^mix^+nxr i+i,rLxi- t 


for  dGZ,  and  [ixyy (X/  =  1)  =  lxg=1  for  w  G  Z,  where  fd  :=  log  \/{l  —  p)/p. 
Proof.  It  is  an  elementary  fact  that  (we  use  the  notation  Y1:k  —  W, . . . ,  Yk) 


FxAXf  =  1|W  :  (r,w)  ?  (£,v)})  =  P(Xf  =  1| X0,Y1:k,{Xrw  :  (r,w)  ±  (€,«)}), 
M*i  =  1|  W  :  (T,w)  ?  (£,v)})  =  P(Xf  =  l\Y1:k,  {X?  :  (r,w)  +  (MOD, 

see  [63,  p.  95-96]  or  [52,  Lemma  3.4],  That  each  statement  in  the  Lemma  holds 
for  P-a.e.  (x,y)  can  therefore  be  read  off  from  Lemma  D.2.  As  there  are  countably 
many  statements,  they  can  be  assumed  to  hold  simultaneously  on  a  set  A  of  unit 
measure.  □ 


We  can  now  complete  the  proof  of  filter  stability  for  p  >  p*. 

Proposition  D.8.  There  exists  an  absolute  constant  0  <  p*  <  1/2  such  that 


|E(/(AM\  •  •  • ,  AT)|X0,  W, . . . ,  Yk 0  -  E(/(M, . . . ,  X?)\Ylf . . . ,  Yk)\ 

<  (8m  +  4)||/||0Oe“fc 

a.s.  for  every  k,m  >  1  and  function  f  whenever  p*  <  p  <  1/2. 

Proof.  We  apply  Theorem  D.6  with  /  =  {0,...,fc}  x  Z  and  //  =  pXty,  u  =  vy  as 
defined  in  Lemma  D.7.  Evidently  bm,v)  —  1  and  Mv)  =  0  for  1  <  £  <  k  and  v  G  Z,  so 
we  have 

m 

i ,  vd)  -  I'M*;™.  ■  ■  ■  wr»i  <  211/iu  T  T_  oMM 

w=—m  v€:7j 


203 


by  Theorem  D.6  provided  that  the  condition  on  the  matrix  C  is  satisfied. 
We  proceed  to  estimate  the  matrix  C  using  Lemma  D.7.  Evidently 


C(e',v')(£,v)  =  0  if  £'  —  0  or  \£'  —  £\  +  \v'  —  v\  >  1  or  £  =  £',  v  —  v' . 
On  the  other  hand,  note  that  by  Lemma  D.7 


=  —4/3 


e4P  +  e~4P 
so  we  can  estimate 

It  follows  readily  that 


T7S  <  =  1|{A7  :  (r,w)  /  («,z>)})  < 


p4/3 


e4^3  +  e_4/3  ’ 


Cji  <  t anh  (4/3)  <  1  for  all  i ,  j  e  I. 


He'll*  :=  sup  e ^  ^Cji  <  4etanh(4/3). 

■?e/  iei 

We  can  now  evidently  choose  0  <  p*  <  1/2  such  that  4e  tanh(4/3)  <  1/2  for  p*  <  p  < 
1/2.  Then  the  condition  of  Theorem  D.6  is  satisfied.  Moreover,  as  ||  •  ||*  is  a  matrix 
norm 

OO 

11-Dll.  <  flic'll/  <  2. 

72—0 

Thus  we  obtain 


|  . . . ,  AT))  -  Uy{f{  x~m, .  .  .  ,  AD)  I 

<  (4m  +  2)||/||0Oe_fc  max  V'  el|(fc,™)-M£) 

w=—m,....m  z ' 

VEZ 

<  (4m  +  2)||D||*||/||0Oe“fc  <  (8m +  4)||/||0Oe“fe. 

As  our  estimates  are  valid  for  P-a.e.  (x,y),  the  proof  is  complete.  □ 
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